(HWILEY 


Linear Models 
in Statistics 


Second Edition 


ALVIN C. RENCHER # G. BRUCE SCHAALJE 


\ i Wi 


LINEAR MODELS IN 
STATISTICS 


51807) 
®WILEY |; 
32007; 


r 
BICENTENNIAL 


z 
w 
& 
z 


THE WILEY BICENTENNIAL—KNOWLEDGE FOR GENERATIONS 


ach generation has its unique needs and aspirations. When Charles Wiley first 
opened his small printing shop in lower Manhattan in 1807, it was a generation 
of boundless potential searching for an identity. And we were there, helping to 
define a new American literary tradition. Over half a century later, in the midst 
of the Second Industrial Revolution, it was a generation focused on building the 
future. Once again, we were there, supplying the critical scientific, technical, and 
engineering knowledge that helped frame the world. Throughout the 20th 
Century, and into the new millennium, nations began to reach out beyond their 
own borders and a new international community was born. Wiley was there, 
expanding its operations around the world to enable a global exchange of ideas, 
opinions, and know-how. 


For 200 years, Wiley has been an integral part of each generation’s journey, 
enabling the flow of information and understanding necessary to meet their needs 
and fulfill their aspirations. Today, bold new technologies are changing the way 
we live and learn. Wiley will be there, providing you the must-have knowledge 
you need to imagine new worlds, new possibilities, and new opportunities. 


Generations come and go, but you can always count on Wiley to provide you the 
knowledge you need, when and where you need it! 


Loe ee [BS Solel 


WILLIAM J. PESCE PETER BOOTH WILEY 
PRESIDENT AND CHIEF EXECUTIVE OFFICER CHAIRMAN OF THE BOARD 


LINEAR MODELS IN 
STATISTICS 


Second Edition 


Alvin C. Rencher and G. Bruce Schaalje 


Department of Statistics, Brigham Young University, Provo, Utah 


BICENTENNIAL 
4 


BICENTENNIAL 


WILEY-INTERSCIENCE 
A JOHN WILEY & SONS, INC., PUBLICATION 


Copyright © 2008 by John Wiley & Sons, Inc. All rights reserved 


Published by John Wiley & Sons, Inc., Hoboken, New Jersey 
Published simultaneously in Canada 


No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form 
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except 

as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the 
prior written permission of the Publisher, or authorization through payment of the appropriate per-copy 
fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 
750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for 
permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River 
Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley. 
com/go/permission. 


Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts 
in preparing this book, they make no representations or warranties with respect to the accuracy or 
completeness of the contents of this book and specifically disclaim any implied warranties of 
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales 
representatives or written sales materials. The advice and strategies contained herein may not be 
suitable for your situation. You should consult with a professional where appropriate. Neither the 
publisher nor author shall be liable for any loss of profit or any other commercial damages, including 
but not limited to special, incidental, consequential, or other damages. 


For general information on our other products and services or for technical support, please contact our 
Customer Care Department within the United States at (800) 762-2974, outside the United States 
at (317) 572-3993 or fax (317) 572-4002. 


Wiley also publishes its books in variety of electronic formats. Some content that appears in print 
may not be available in electronic formats. For more information about Wiley products, visit our 
web site at www.wiley.com. 


Wiley Bicentennial Logo: Richard J. Pacifico 


Library of Congress Cataloging-in-Publication Data: 
Rencher, Alvin C., 1934- 
Linear models in statistics/Alvin C. Rencher, G. Bruce Schaalje. — 2nd ed. 
p. cm. 
Includes bibliographical references. 
ISBN 978-0-471-75498-5 (cloth) 
1. Linear models (Statistics) I. Schaalje, G. Bruce. II. Title. 
QA276.R425 2007 
519.5'35—dce22 
2007024268 
Printed in the United States of America 


10987654321 


CONTENTS 


Preface xiii 


1 Introduction 1 


1.1 Simple Linear Regression Model 1 
1.2 Multiple Linear Regression Model 2 
1.3. Analysis-of-Variance Models 3 


2 Matrix Algebra 5 


2.1 Matrix and Vector Notation 5 


2.1.1 Matrices, Vectors, and Scalars 5 
2.1.2 Matrix Equality 6 
2.1.3. Transpose 7 
2.1.4 Matrices of Special Form 7 
2.2 Operations 9 
2.2.1 Sum of Two Matrices or Two Vectors 9 
2.2.2 Product of a Scalar and a Matrix 10 
2.2.3 Product of Two Matrices or Two Vectors 10 
2.2.4 Hadamard Product of Two 
Matrices or Two Vectors 16 


2.3 Partitioned Matrices 16 
2.4 Rank 19 
2.5 Inverse 21 
2.6 Positive Definite Matrices 24 
2.7 Systems of Equations 28 
2.8 Generalized Inverse 32 
2.8.1 Definition and Properties 33 
2.8.2 Generalized Inverses and Systems of Equations 36 
2.9 Determinants 37 
2.10 Orthogonal Vectors and Matrices 41 
2.11 Trace 44 
2.12 Eigenvalues and Eigenvectors 46 


2.12.1 Definition 46 
2.12.2 Functions of a Matrix 49 


vi 


CONTENTS 


2.12.3 Products 50 

2.12.4 Symmetric Matrices 51 

2.12.5 Positive Definite and Semidefinite Matrices 53 
2.13 Idempotent Matrices 54 
2.14 Vector and Matrix Calculus 56 


2.14.1 Derivatives of Functions of Vectors and Matrices 56 
2.14.2 Derivatives Involving Inverse Matrices and Determinants 58 
2.14.3. Maximization or Minimization of a Function of a Vector 60 


Random Vectors and Matrices 69 


3.1 Introduction 69 
3.2 Means, Variances, Covariances, and Correlations 70 
3.3. Mean Vectors and Covariance Matrices for Random Vectors 75 


3.3.1 Mean Vectors 75 
3.3.2 Covariance Matrix 75 
3.3.3. Generalized Variance 77 
3.3.4 Standardized Distance 77 
3.4 Correlation Matrices 77 
3.5 Mean Vectors and Covariance Matrices for 
Partitioned Random Vectors 78 
3.6 Linear Functions of Random Vectors 79 
3.6.1 Means 80 
3.6.2 Variances and Covariances 81 


Multivariate Normal Distribution 87 


4.1 Univariate Normal Density Function 87 

4.2 Multivariate Normal Density Function 88 

4.3. Moment Generating Functions 90 

4.4 Properties of the Multivariate Normal Distribution 92 
4.5 Partial Correlation 100 


Distribution of Quadratic Forms in y 105 


5.1 Sums of Squares 105 
5.2. Mean and Variance of Quadratic Forms 107 
5.3. Noncentral Chi-Square Distribution 112 
5.4 Noncentral F and ¢ Distributions 114 
5.4.1 Noncentral F Distribution 114 
5.4.2 Noncentral ¢ Distribution 116 
5.5. Distribution of Quadratic Forms 117 
5.6 Independence of Linear Forms and Quadratic Forms 119 


CONTENTS vii 


6 Simple Linear Regression 127 
6.1 The Model 127 
6.2 Estimation of Bo, B:, and o* 128 
6.3 Hypothesis Test and Confidence Interval for B, 132 
6.4 Coefficient of Determination 133 
7 Multiple Regression: Estimation 137 
7.1 Introduction 137 
7.2. The Model 137 
7.3 Estimation of B and a? 141 
7.3.1 Least-Squares Estimator for B 145 
7.3.2 Properties of the Least-Squares Estimator B 141 
7.3.3. An Estimator for 07 149 
7.4 Geometry of Least-Squares 151 
7.4.1 Parameter Space, Data Space, and Prediction Space 152 
7.4.2 Geometric Interpretation of the Multiple 
Linear Regression Model 153 
7.5. The Model in Centered Form 154 
7.6 Normal Model 157 
7.6.1 Assumptions 157 
7.6.2. Maximum Likelihood Estimators for B and 0? 158 
7.6.3 Properties of B and 6? 159 
7.7 R? in Fixed-x Regression 161 
7.8 Generalized Least-Squares: cov(y) = 0°V_ 164 
7.8.1 Estimation of B and 0? when cov(y)=0°V_ 164 
7.8.2 Misspecification of the Error Structure 167 
7.9 Model Misspecification 169 
7.10 Orthogonalization 174 
8 Multiple Regression: Tests of Hypotheses 


and Confidence Intervals 185 
8.1 Test of Overall Regression 185 
8.2 Test on a Subset of the B Values 189 
8.3 F Test in Terms of R* 196 
8.4 The General Linear Hypothesis Tests for Ho: 
CB =0and Hy: CB=t 198 
8.4.1 The Test for Hp: CB=0 198 
8.4.2 The Test for Hp: CB=t 203 
8.5 Tests on B; anda’B 204 


8.5.1 Testing One B; or One a’B 204 
8.5.2 Testing Several 6; or a’;B Values 205 


Vii 


8.6 Confidence Intervals and Prediction Intervals 209 
8.6.1 Confidence Region for B 209 
8.6.2 Confidence Interval for B; 210 
8.6.3. Confidence Interval fora’B 211 
8.6.4 Confidence Interval for E(y) 211 
8.6.5 Prediction Interval for a Future Observation 213 
8.6.6 Confidence Interval for 0? 215 
8.6.7. Simultaneous Intervals 215 


8.7 Likelihood Ratio Tests 217 


9 Multiple Regression: Model Validation and Diagnostics 


9.1 Residuals 227 

9.2 The Hat Matrix 230 

9.3 Outliers 232 

9.4 Influential Observations and Leverage 235 


10 Multiple Regression: Random x’s 


10.1 Multivariate Normal Regression Model 244 


CONTENTS 


227 


243 


10.2 Estimation and Testing in Multivariate Normal Regression 245 


10.3 Standardized Regression Coefficents 249 

10.4. R? in Multivariate Normal Regression 254 

10.5 Tests and Confidence Intervals for R’ 258 

10.6 Effect of Each Variable on R* 262 

10.7 Prediction for Multivariate Normal or Nonnormal Data 265 
10.8 Sample Partial Correlations 266 


11 Multiple Regression: Bayesian Inference 


11.1 Elements of Bayesian Statistical Inference 277 
11.2. A Bayesian Multiple Linear Regression Model 279 
11.2.1 A Bayesian Multiple Regression Model 
with a Conjugate Prior 280 
11.2.2 Marginal Posterior Density of B 282 
11.2.3. Marginal Posterior Densities of rand a* 284 
11.3 Inference in Bayesian Multiple Linear Regression 285 
11.3.1 Bayesian Point and Interval Estimates of 
Regression Coefficients 285 
11.3.2 Hypothesis Tests for Regression Coefficients 
in Bayesian Inference 286 
11.3.3. Special Cases of Inference in Bayesian Multiple 
Regression Models 286 
11.3.4 Bayesian Point and Interval Estimation of ao 287 


277 


CONTENTS ix 


12 


13 


11.4 Bayesian Inference through Markov Chain 
Monte Carlo Simulation 288 
11.5 Posterior Predictive Inference 290 


Analysis-of-Variance Models 295 


12.1 Non-Full-Rank Models 295 
12.1.1 One-Way Model 295 
12.1.2 Two-Way Model 299 
12.2 Estimation 301 
12.2.1 Estimation of B 302 
12.2.2 Estimable Functions of B 305 
12.3 Estimators 309 
12.3.1 Estimators of A’B 309 
12.3.2 Estimation of o* 313 
12.3.3 Normal Model 314 
12.4 Geometry of Least-Squares in the 
Overparameterized Model 316 
12.5 Reparameterization 318 
12.6 Side Conditions 320 
12.7 Testing Hypotheses 323 
12.7.1 Testable Hypotheses 323 
12.7.2 Full-Reduced-Model Approach 324 
12.7.3 General Linear Hypothesis 326 
12.8 An Ilustration of Estimation and Testing 329 
12.8.1 Estimable Functions 330 
12.8.2 Testing a Hypothesis 331 
12.8.3 Orthogonality of Columns of X 333 


One-Way Analysis-of-Variance: Balanced Case 339 


13.1 The One-Way Model 339 

13.2 Estimable Functions 340 

13.3. Estimation of Parameters 341 
13.3.1 Solving the Normal Equations 341 
13.3.2. An Estimator for 07 343 

13.4 Testing the Hypothesis Hp: wy = Wo =---= py 344 
13.4.1. Full—Reduced-Model Approach 344 
13.4.2 General Linear Hypothesis 348 

13.5. Expected Mean Squares 351 
13.5.1 Full-Reduced-Model Approach 352 
13.5.2 General Linear Hypothesis 354 


x CONTENTS 


13.6 Contrasts 357 


13.6.1 Hypothesis Test for a Contrast 357 
13.6.2 Orthogonal Contrasts 358 
13.6.3. Orthogonal Polynomial Contrasts 363 


14 Two-Way Analysis-of-Variance: Balanced Case 377 


14.1 The Two-Way Model 377 
14.2 Estimable Functions 378 
14.3. Estimators of A’B and a? 382 


14.3.1 Solving the Normal Equations and Estimating A’B 382 
14.3.2 An Estimator for 0* 384 


14.4 Testing Hypotheses 385 


14.4.1 Test for Interaction 385 
14.4.2 Tests for Main Effects 395 


14.5. Expected Mean Squares 403 


14.5.1 Sums-of-Squares Approach 403 
14.5.2 Quadratic Form Approach 405 


15 Analysis-of-Variance: The Cell Means Model for 
Unbalanced Data 413 


15.1 Introduction 413 
15.2 One-Way Model 415 


15.2.1 Estimation and Testing 415 
15.2.2 Contrasts 417 


15.3. Two-Way Model 421 


15.3.1 Unconstrained Model 421 
15.3.2 Constrained Model 428 


15.4 Two-Way Model with Empty Cells 432 


16 Analysis-of-Covariance 443 


16.1 Introduction 443 
16.2 Estimation and Testing 444 
16.2.1 The Analysis-of-Covariance Model 444 
16.2.2 Estimation 446 
16.2.3 Testing Hypotheses 448 
16.3. One-Way Model with One Covariate 449 
16.3.1 The Model 449 
16.3.2 Estimation 449 
16.3.3 Testing Hypotheses 450 


CONTENTS 


16.4 Two-Way Model with One Covariate 457 
16.4.1 Tests for Main Effects and Interactions 458 
16.4.2 Test for Slope 462 
16.4.3 Test for Homogeneity of Slopes 463 
16.5 One-Way Model with Multiple Covariates 464 
16.5.1 The Model 464 
16.5.2 Estimation 465 
16.5.3 Testing Hypotheses 468 
16.6 Analysis-of-Covariance with Unbalanced Models 473 


17. Linear Mixed Models 


17.1 Introduction 479 
17.2. The Linear Mixed Model 479 
17.3. Examples 481 
17.4 Estimation of Variance Components 486 
17.5 Inference for B 490 
17.5.1 An Estimator for B 490 
17.5.2 Large-Sample Inference for Estimable Functions of B 
17.5.3 Small-Sample Inference for Estimable Functions of B 
17.6 Inference for the a; Terms 497 
17.7. Residual Diagnostics 501 


18 Additional Models 


18.1 Nonlinear Regression 507 

18.2 Logistic Regression 508 

18.3. Loglinear Models 511 

18.4 Poisson Regression 512 

18.5 Generalized Linear Models 513 


Appendix A Answers and Hints to the Problems 


References 


Index 


xi 


479 


491 
491 


507 


517 


653 


663 


PREFACE 


In the second edition, we have added chapters on Bayesian inference in linear models 
(Chapter 11) and linear mixed models (Chapter 17), and have upgraded the material 
in all other chapters. Our continuing objective has been to introduce the theory of 
linear models in a clear but rigorous format. 

In spite of the availability of highly innovative tools in statistics, the main tool of 
the applied statistician remains the linear model. The linear model involves the sim- 
plest and seemingly most restrictive statistical properties: independence, normality, 
constancy of variance, and linearity. However, the model and the statistical 
methods associated with it are surprisingly versatile and robust. More importantly, 
mastery of the linear model is a prerequisite to work with advanced statistical tools 
because most advanced tools are generalizations of the linear model. The linear 
model is thus central to the training of any statistician, applied or theoretical. 

This book develops the basic theory of linear models for regression, analysis-of- 
variance, analysis—of—covariance, and linear mixed models. Chapter 18 briefly intro- 
duces logistic regression, generalized linear models, and nonlinear models. 
Applications are illustrated by examples and problems using real data. This combination 
of theory and applications will prepare the reader to further explore the literature and to 
more correctly interpret the output from a linear models computer package. 

This introductory linear models book is designed primarily for a one-semester 
course for advanced undergraduates or MS students. It includes more material than 
can be covered in one semester so as to give an instructor a choice of topics and to 
serve as a reference book for researchers who wish to gain a better understanding 
of regression and analysis-of-variance. The book would also serve well as a text 
for PhD classes in which the instructor is looking for a one-semester introduction, 
and it would be a good supplementary text or reference for a more advanced PhD 
class for which the students need to review the basics on their own. 

Our overriding objective in the preparation of this book has been clarity of expo- 
sition. We hope that students, instructors, researchers, and practitioners will find this 
linear models text more comfortable than most. In the final stages of development, we 
asked students for written comments as they read each day’s assignment. They made 
many suggestions that led to improvements in readability of the book. We are grateful 
to readers who have notified us of errors and other suggestions for improvements of 
the text, and we will continue to be very grateful to readers who take the time to do so 
for this second edition. 


xiii 


xiv PREFACE 


Another objective of the book is to tie up loose ends. There are many approaches 
to teaching regression, for example. Some books present estimation of regression 
coefficients for fixed x’s only, other books use random x’s, some use centered 
models, and others define estimated regression coefficients in terms of variances 
and covariances or in terms of correlations. Theory for linear models has been pre- 
sented using both an algebraic and a geometric approach. Many books present clas- 
sical (frequentist) inference for linear models, while increasingly the Bayesian 
approach is presented. We have tried to cover all these approaches carefully and to 
show how they relate to each other. We have attempted to do something similar 
for various approaches to analysis-of-variance. We believe that this will make the 
book useful as a reference as well as a textbook. An instructor can choose the 
approach he or she prefers, and a student or researcher has access to other methods 
as well. 

The book includes a large number of theoretical problems and a smaller number of 
applied problems using real datasets. The problems, along with the extensive set of 
answers in Appendix A, extend the book in two significant ways: (1) the theoretical 
problems and answers fill in nearly all gaps in derivations and proofs and also extend 
the coverage of material in the text, and (2) the applied problems and answers become 
additional examples illustrating the theory. As instructors, we find that having 
answers available for the students saves a great deal of class time and enables us to 
cover more material and cover it better. The answers would be especially useful to 
a reader who is engaging this material outside the formal classroom setting. 

The mathematical prerequisites for this book are multivariable calculus and matrix 
algebra. The review of matrix algebra in Chapter 2 is intended to be sufficiently com- 
plete so that the reader with no previous experience can master matrix manipulation 
up to the level required in this book. Statistical prerequisites include some exposure to 
statistical theory, with coverage of topics such as distributions of random variables, 
expected values, moment generating functions, and an introduction to estimation 
and testing hypotheses. These topics are briefly reviewed as each is introduced. 
One or two statistical methods courses would also be helpful, with coverage of 
topics such as ¢ tests, regression, and analysis-of-variance. 

We have made considerable effort to maintain consistency of notation throughout 
the book. We have also attempted to employ standard notation as far as possible and 
to avoid exotic characters that cannot be readily reproduced on the chalkboard. With a 
few exceptions, we have refrained from the use of abbreviations and mnemonic 
devices. We often find these annoying in a book or journal article. 

Equations are numbered sequentially throughout each chapter; for example, (3.29) 
indicates the twenty-ninth numbered equation in Chapter 3. Tables and figures are 
also numbered sequentially throughout each chapter in the form “Table 7.4” or 
“Figure 3.2.” On the other hand, examples and theorems are numbered sequentially 
within a section, for example, Theorems 2.2a and 2.2b. 

The solution of most of the problems with real datasets requires the use of the com- 
puter. We have not discussed command files or output of any particular program, 
because there are so many good packages available. Computations for the numerical 
examples and numerical problems were done with SAS. The datasets and SAS 


PREFACE xv 


command files for all the numerical examples and problems in the text are available 
on the Internet; see Appendix B. 

The references list is not intended to be an exhaustive survey of the literature. We 
have provided original references for some of the basic results in linear models and 
have also referred the reader to many up-to-date texts and reference books useful for 
further reading. When citing references in the text, we have used the standard format 
involving the year of publication. For journal articles, the year alone suffices, for 
example, Fisher (1921). But for a specific reference in a book, we have included a 
page number or section, as in Hocking (1996, p. 216). 

Our selection of topics is intended to prepare the reader for a better understanding 
of applications and for further reading in topics such as mixed models, generalized 
linear models, and Bayesian models. Following a brief introduction in Chapter 1, 
Chapter 2 contains a careful review of all aspects of matrix algebra needed to read 
the book. Chapters 3, 4, and 5 cover properties of random vectors, matrices, and 
quadratic forms. Chapters 6, 7, and 8 cover simple and multiple linear regression, 
including estimation and testing hypotheses and consequences of misspecification 
of the model. Chapter 9 provides diagnostics for model validation and detection of 
influential observations. Chapter 10 treats multiple regression with random x’s. 
Chapter 11 covers Bayesian multiple linear regression models along with Bayesian 
inferences based on those models. Chapter 12 covers the basic theory of analysis- 
of-variance models, including estimability and testability for the overparameterized 
model, reparameterization, and the imposition of side conditions. Chapters 13 and 
14 cover balanced one-way and two-way analysis-of-variance models using an over- 
parameterized model. Chapter 15 covers unbalanced analysis-of-variance models 
using a cell means model, including a section on dealing with empty cells in two- 
way analysis-of-variance. Chapter 16 covers analysis of covariance models. 
Chapter 17 covers the basic theory of linear mixed models, including residual 
maximum likelihood estimation of variance components, approximate small- 
sample inferences for fixed effects, best linear unbiased prediction of random 
effects, and residual analysis. Chapter 18 introduces additional topics such as 
nonlinear regression, logistic regression, loglinear models, Poisson regression, and 
generalized linear models. 

In our class for first-year master’s-level students, we cover most of the material in 
Chapters 2—5, 7-8, 10-12, and 17. Many other sequences are possible. For example, 
a thorough one-semester regression and analysis-of-variance course could cover 
Chapters 1-10, and 12-15. 

Al’s introduction to linear models came in classes taught by Dale Richards and 
Rolf Bargmann. He also learned much from the books by Graybill, Scheffé, and 
Rao. Al expresses thanks to the following for reading the first edition manuscript 
and making many valuable suggestions: David Turner, John Walker, Joel 
Reynolds, and Gale Rex Bryce. Al thanks the following students at Brigham 
Young University (BYU) who helped with computations, graphics, and typing of 
the first edition: David Fillmore, Candace Baker, Scott Curtis, Douglas Burton, 
David Dahl, Brenda Price, Eric Hintze, James Liechty, and Joy Willbur. The students 


xvi PREFACE 


in Al’s Linear Models class went through the manuscript carefully and spotted many 
typographical errors and passages that needed additional clarification. 

Bruce’s education in linear models came in classes taught by Mel Carter, Del 
Scott, Doug Martin, Peter Bloomfield, and Francis Giesbrecht, and influential short 
courses taught by John Nelder and Russ Wolfinger. 

We thank Bruce’s Linear Models classes of 2006 and 2007 for going through the 
book and new chapters. They made valuable suggestions for improvement of the text. 
We thank Paul Martin and James Hattaway for invaluable help with LaTex. The 
Department of Statistics, Brigham Young University provided financial support 
and encouragement throughout the project. 


Second Edition 


For the second edition we added Chapter 11 on Bayesian inference in linear models 
(including Gibbs sampling) and Chapter 17 on linear mixed models. 

We also added a section in Chapter 2 on vector and matrix calculus, adding several 
new theorems and covering the Lagrange multiplier method. In Chapter 4, we pre- 
sented a new proof of the conditional distribution of a subvector of a multivariate 
normal vector. In Chapter 5, we provided proofs of the moment generating function 
and variance of a quadratic form of a multivariate normal vector. The section on the 
geometry of least squares was completely rewritten in Chapter 7, and a section on the 
geometry of least squares in the overparameterized linear model was added to 
Chapter 12. Chapter 8 was revised to provide more motivation for hypothesis 
testing and simultaneous inference. A new section was added to Chapter 15 
dealing with two-way analysis-of-variance when there are empty cells. This material 
is not available in any other textbook that we are aware of. 

This book would not have been possible without the patience, support, and 
encouragement of Al’s wife LaRue and Bruce’s wife Lois. Both have helped and sup- 
ported us in more ways than they know. This book is dedicated to them. 


ALVIN C. RENCHER AND G. BRUCE SCHAALJE 


Department of Statistics 
Brigham Young University 
Provo, Utah 


1 Introduction 


The scientific method is frequently used as a guided approach to learning. Linear 
statistical methods are widely used as part of this learning process. In the biological, 
physical, and social sciences, as well as in business and engineering, linear models 
are useful in both the planning stages of research and analysis of the resulting data. 
In Sections 1.1-1.3, we give a brief introduction to simple and multiple linear 
regression models, and analysis-of-variance (ANOVA) models. 


1.1 SIMPLE LINEAR REGRESSION MODEL 


In simple linear regression, we attempt to model the relationship between two vari- 
ables, for example, income and number of years of education, height and weight 
of people, length and width of envelopes, temperature and output of an industrial 
process, altitude and boiling point of water, or dose of a drug and response. For a 
linear relationship, we can use a model of the form 


y=Pot+Pixte, (1.1) 


where y is the dependent or response variable and x is the independent or predictor 
variable. The random variable ¢ is the error term in the model. In this context, error 
does not mean mistake but is a statistical term representing random fluctuations, 
measurement errors, or the effect of factors outside of our control. 

The linearity of the model in (1.1) is an assumption. We typically add other 
assumptions about the distribution of the error terms, independence of the observed 
values of y, and so on. Using observed values of x and y, we estimate By and B, and 
make inferences such as confidence intervals and tests of hypotheses for By and B,. 
We may also use the estimated model to forecast or predict the value of y for a 
particular value of x, in which case a measure of predictive accuracy may also be 
of interest. 

Estimation and inferential procedures for the simple linear regression model are 
developed and illustrated in Chapter 6. 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


2 INTRODUCTION 
1.2 MULTIPLE LINEAR REGRESSION MODEL 


The response y is often influenced by more than one predictor variable. For example, 
the yield of a crop may depend on the amount of nitrogen, potash, and phosphate fer- 
tilizers used. These variables are controlled by the experimenter, but the yield may 
also depend on uncontrollable variables such as those associated with weather. 

A linear model relating the response y to several predictors has the form 


y = Bot Bix + Box2 +--+ + Byxe +e. (1.2) 


The parameters Bo, B,,..., 6; are called regression coefficients. As in (1.1), € 
provides for random variation in y not explained by the x variables. This random 
variation may be due partly to other variables that affect y but are not known or 
not observed. 

The model in (1.2) is linear in the 6 parameters; it is not necessarily linear in the x 
variables. Thus models such as 


y=Pot+ Pin + Box; + B3x2 + By sinx, + 


are included in the designation linear model. 

A model provides a theoretical framework for better understanding of a pheno- 
menon of interest. Thus a model is a mathematical construct that we believe may 
represent the mechanism that generated the observations at hand. The postulated 
model may be an idealized oversimplification of the complex real-world situation, 
but in many such cases, empirical models provide useful approximations of the 
relationships among variables. These relationships may be either associative or 
causative. 

Regression models such as (1.2) are used for various purposes, including the 
following: 


1. Prediction. Estimates of the individual parameters Bp, B,, ..., 8, are of less 
importance for prediction than the overall influence of the x variables on y. 
However, good estimates are needed to achieve good prediction performance. 

2. Data Description or Explanation. The scientist or engineer uses the estimated 
model to summarize or describe the observed data. 

3. Parameter Estimation. The values of the estimated parameters may have 
theoretical implications for a postulated model. 


4. Variable Selection or Screening. The emphasis is on determining the import- 
ance of each predictor variable in modeling the variation in y. The predictors 
that are associated with an important amount of variation in y are retained; 
those that contribute little are deleted. 

5. Control of Output. A cause-and-effect relationship between y and the x 
variables is assumed. The estimated model might then be used to control the 


1.3 ANALYSIS-OF-VARIANCE MODELS 3 


output of a process by varying the inputs. By systematic experimentation, it 
may be possible to achieve the optimal output. 


There is a fundamental difference between purposes | and 5. For prediction, we need 
only assume that the same correlations that prevailed when the data were collected 
also continue in place when the predictions are to be made. Showing that there is a 
significant relationship between y and the x variables in (1.2) does not necessarily 
prove that the relationship is causal. To establish causality in order to control 
output, the researcher must choose the values of the x variables in the model and 
use randomization to avoid the effects of other possible variables unaccounted for. 
In other words, to ascertain the effect of the x variables on y when the x variables 
are changed, it is necessary to change them. 

Estimation and inferential procedures that contribute to the five purposes listed 
above are discussed in Chapters 7-11. 


1.3. ANALYSIS-OF-VARIANCE MODELS 


In analysis-of-variance (ANOVA) models, we are interested in comparing several 
populations or several conditions in a study. Analysis-of-variance models can be 
expressed as linear models with restrictions on the x values. Typically the x’s are Os 
or ls. For example, suppose that a researcher wishes to compare the mean yield for 
four types of catalyst in an industrial process. If n observations are to be obtained for 
each catalyst, one model for the 4n observations can be expressed as 


Yj = My tey 1=1,2,3,4, f= 1,2,...,n, (1.3) 


where py, is the mean corresponding to the ith catalyst. A hypothesis of interest is 
Ao : by = By = M3 = My. The model in (1.3) can be expressed in the alternative form 


yg=eMtatejy 1=1,2,3,4, j=1,2,...,0. (1.4) 


In this form, a; is the effect of the ith catalyst, and the hypothesis can be expressed as 
Ho : @| = a2 = 3 = 4. 

Suppose that the researcher also wishes to compare the effects of three levels of 
temperature and that n observations are taken at each of the 12 catalyst—temperature 
combinations. Then the model can be expressed as 


Vik = May + Sie = B+ OG + Bi + Vy + Eijx (9) 
$50 5 A103 eS 1 i 


where 4; is the mean for the ijth catalyst— temperature combination, a; is the effect of 
the ith catalyst, B; is the effect of the jth level of temperature, and y,; is the interaction 
or joint effect of the ith catalyst and jth level of temperature. 


4 INTRODUCTION 


In the examples leading to models (1.3)—(1.5), the researcher chooses the type of 
catalyst or level of temperature and thus applies different treatments to the objects or 
experimental units under study. In other settings, we compare the means of variables 
measured on natural groupings of units, for example, males and females or various 
geographic areas. 

Analysis-of-variance models can be treated as a special case of regression models, 
but it is more convenient to analyze them separately. This is done in Chapters 12-15. 
Related topics, such as analysis-of-covariance and mixed models, are covered in 
Chapters 16-17. 


Z Matrix Algebra 


If we write a linear model such as (1.2) for each of n observations in a dataset, the n 
resulting models can be expressed in a single compact matrix expression. Then the 
estimation and testing results can be more easily obtained using matrix theory. 

In the present chapter, we review the elements of matrix theory needed in the 
remainder of the book. Proofs that seem instructive are included or called for in 
the problems. For other proofs, see Graybill (1969), Searle (1982), Harville (1997), 
Schott (1997), or any general text on matrix theory. We begin with some basic defi- 
nitions in Section 2.1. 


2.1 MATRIX AND VECTOR NOTATION 


2.1.1. Matrices, Vectors, and Scalars 


A matrix is a rectangular or square array of numbers or variables. We use uppercase 
boldface letters to represent matrices. In this book, all elements of matrices will be 
real numbers or variables representing real numbers. For example, the height (in 
inches) and weight (in pounds) for three students are listed in the following matrix: 


65 154 
A=([ 73 182}. (2.1) 
68 167 


To represent the elements of A as variables, we use 


a1 412 
A= (aj) = a2, + a22 . (2.2) 
431 432 


The first subscript in a,; indicates the row; the second identifies the column. The nota- 
tion A = (aj) represents a matrix by means of a typical element. 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


6 MATRIX ALGEBRA 


The matrix A in (2.1) or (2.2) has three rows and two columns, and we say that A is 
3 x 2, or that the size of A is 3 x 2. 

A vector is a matrix with a single row or column. Elements in a vector are often 
identified by a single subscript; for example 


As aconvention, we use lowercase boldface letters for column vectors and lowercase 
boldface letters followed by the prime symbol (‘) for row vectors; for example 


x! = (1, %2,.%3) = (X1 Xp X3). 


(Row vectors are regarded as transposes of column vectors. The transpose is defined 
in Section 2.1.3 below). We use either commas or spaces to separate elements of a 
row vector. 

Geometrically, a row or column vector with p elements can be associated with a 
point in a p-dimensional space. The elements in the vector are the coordinates of the 
point. Sometimes we are interested in the distance from the origin to the point 
(vector), the distance between two points (vectors), or the angle between the 
arrows drawn from the origin to the two points. 

In the context of matrices and vectors, a single real number is called a scalar. Thus 
2.5, —9, and 7.26 are scalars. A variable representing a scalar will be denoted by a 
lightface letter (usually lowercase), such as c. A scalar is technically distinct from a 
1 x 1 matrix in terms of its uses and properties in matrix algebra. The same notation 
is often used to represent a scalar and a 1 x 1 matrix, but the meaning is usually 
obvious from the context. 


2.1.2 Matrix Equality 


Two matrices or two vectors are equal if they are of the same size and if the elements 
in corresponding positions are equal; for example 


BD ANY 3 
le Se Pe s> Nh 735 a? 


but 


2.1 MATRIX AND VECTOR NOTATION a 


2.1.3 Transpose 


If we interchange the rows and columns of a matrix A, the resulting matrix is known 
as the transpose of A and is denoted by A’; for example 


le aly #08), 
1 3 
Formally, if A is denoted by A = (aj), then A’ is defined as 
A= (aj)! = (aji). (2.3) 
This notation indicates that the element in the ith row and jth column of A is found in 


the jth row and ith column of A’. If the matrix A is n x p, then A’ is p x n. 
If a matrix is transposed twice, the result is the original matrix. 


Theorem 2.1. If A is any matrix, then 


(A =A. (2.4) 


Proor. By (2.3), A’ = (ajj)’ = (aji). Then (A’)’ = (aj)! = (aj) = A. 
(The notation UI is used to indicate the end of a theorem proof, corollary proof or 
example.) 


2.1.4 Matrices of Special Form 


If the transpose of a matrix A is the same as the original matrix, that is, if A’ = A or 
equivalently (a;;) = (aj), then the matrix A is said to be symmetric. For example 


32 6 
A={2 10 -—7 
6 —7 9 


is symmetric. Clearly, all symmetric matrices are square. 


The diagonal of a p xX p square matrix A = (aj;) consists of the elements 
411, 422,...,@pp. If a matrix contains zeros in all off-diagonal positions, it is said 


8 MATRIX ALGEBRA 


to be a diagonal matrix; for example, consider the matrix 


8 0 0 0 

0 -3 0 0 
D0 0 0 0 ]’ 

0 oO 0 4 


which can also be denoted as 


D = diag(8, —3,0, 4). 


We also use the notation diag(A) to indicate a diagonal matrix with the same diagonal 
elements as A; for example 


3 2 6 3 0 0 
A={,{2 10 -7], diag(A)=]0O 10 O 
6 -7 9 0 0 9 


A diagonal matrix with a | in each diagonal position is called an identity matrix, 
and is denoted by I; for example 


(2.5) 


— 
II 
oor 
oro 
- OO 


An upper triangular matrix is a square matrix with zeros below the diagonal; for 
example, 


7 2. 3°45 
0 0-2 6 
: 00 4 1 
00 0 8 
A lower triangular matrix is defined similarly. 
A vector of 1s is denoted by j: 
1 
: 1 
J=]. |- (2.6) 


2.2. OPERATIONS 9 


A square matrix of 1s is denoted by J; for example 


11 
J=(1 1 1). (2.7) 
1 


0 0 
0 0}. (2.8) 
0 0 


2.2 OPERATIONS 


We now define sums and products of matrices and vectors and consider some pro- 
perties of these sums and products. 


2.2.1 Sum of Two Matrices or Two Vectors 


If two matrices or two vectors are the same size, they are said to be conformal 
for addition. Their sum is found by adding corresponding elements. Thus, if A is 
nxp and B is nxp, then C=A+B is also nxXp and is found as 
C = (cy) = (aj + by); for example 


EI aE a ile, Oo ede Oe ae 
2 8 -5 3 4 2) \5 12 -3)° 


The difference D = A — B between two conformal matrices A and B is defined simi- 
larly: D = (dj) = (ay — bj). 
Two properties of matrix addition are given in the following theorem. 


Theorem 2.2a. If A and B are both n x m, then 


(ji) A+ B=B+A. (2.9) 
(ii) (A+ BY =A’ +B’. (2.10) 


10 MATRIX ALGEBRA 


2.2.2. Product of a Scalar and a Matrix 


Any scalar can be multiplied by any matrix. The product of a scalar and a matrix is 
defined as the product of each element of the matrix and the scalar: 


Cay, = Caj2_ + ** Calm 
Cca2| Ca22,—*** Cam 

cA=(cay)= |. 4. (2.11) 
Can] Can2 ae Canm 


Since caj; = ajc, the product of a scalar and a matrix is commutative: 


cA = Ac. (2.12) 


2.2.3. Product of Two Matrices or Two Vectors 


In order for the product AB to be defined, the number of columns in A must equal the 
number of rows in B, in which case A and B are said to be conformal for multipli- 
cation. Then the (ij)th element of the product C = AB is defined as 


cy = Sains, (2.13) 
k 


which is the sum of products of the elements in the ith row of A and the elements in 
the jth column of B. Thus we multiply every row of A by every column of B. If A is 
n X mand B is m x p, then C = AB isn x p. We illustrate matrix multiplication in 
the following example. 


Example 2.2.3. Let 


2-T4U94353 . DA+1-64368 13 38 
eel )=(a1 82) 


FTE GVOA SB Av AP Gx G4538 31 92 
18 25 23 

BA= | 28 38 36 
38 51 49 


Note that a 1 x 1 matrix A can only be multiplied on the right by a 1 x n matrix B or 
on the left by an n x 1 matrix C, whereas a scalar can be multiplied on the right or 
left by a matrix of any size. 


2.2 OPERATIONS 11 


If A is nm x m and B is m x p, where n ¥ p, then AB is defined, but BA is not 
defined. If A is n x p and B is p x n, then AB is n x n and BA is p x p. In this 
case, of course, AB # BA, as illustrated in Example 2.2.3. If A and B are both 
n x n, then AB and BA are the same size, but, in general 


AB + BA. (2.14) 


[There are a few exceptions to (2.14), for example, two diagonal matrices or a square 
matrix and an identity.] Thus matrix multiplication is not commutative, and certain 
familiar manipulations with real numbers cannot be done with matrices. However, 
matrix multiplication is distributive over addition or subtraction: 


A(B + C) = AB + AC, (2.15) 


(A + B)C = AC + BC. (2.16) 


Using (2.15) and (2.16), we can expand products such as (A — B)(C — D): 


(A —B\(C — D) =(A—B)C—(A—B)D__[by (2.15)] 
=AC—BC—AD+BD __[by (2.16)]. (2.17) 


Multiplication involving vectors follows the same rules as for matrices. Suppose 
that A is n x p, b is p x 1, cis p x 1, and dis n x 1. Then Ab is a column vector 
of size n x 1, d‘A is a row vector of size 1 x p, b’c is a sum of products (1 x 1), 
be’ is ap x p matrix, and cd’ is ap x n matrix. Since b’c is a 1 x 1 sum of products, 
it is equal to c’b: 


b’e = bicy =o Xe) stet2 ae DyCps 
cb = cb, +r Cob Sa eae an Cpbp, 
b’c = c'b. (2.18) 


The matrix ed’ is given by 


cd, cdo eins cidy 
cod, C2d Cas Crd 
cd’ = : . . : (2.19) 


Crd, Cp +++ Cdn 


12 MATRIX ALGEBRA 


Similarly 
b’b = bi, +b5+---+b%, (2.20) 
by bby ++ bby 
bob, 5+ baby 
bb’=] any (2.21) 
bpby  Bpbr +++ by 


Thus, b’b is a sum of squares and bb’ is a (symmetric) square matrix. 
The square root of the sum of squares of the elements of a p x | vector b is the 
distance from the origin to the point b and is also referred to as the length of b: 


P 
Length of b = Vb'b = ,/5_ 8}. (2.22) 
i=1 


If jis ann x 1 vector of 1s as defined in (2.6), then by (2.20) and (2.21), we have 
fisn W=]. . Ss (2.23) 


where J is ann x n square matrix of 1s as illustrated in (2.7). If ais n x 1 and A is 
n X p, then 


aj=ja=) oa, (2.24) 
i=1 


yay 

yj Bi 
f= (Dan Daw. Dae), Aj=| (2.25) 
oj nj 


Thus a’j is the sum of the elements in a, j’A contains the column sums of A, and Aj 
contains the row sums of A. Note that in a’j, the vector j isn x 1; in j/A, the vector j 
is n xX 1; and in Aj, the vector j is p x 1. 


2.2. OPERATIONS 13 


The transpose of the product of two matrices is the product of the transposes in 
reverse order. 


Theorem 2.2b. If A isn x p and B is p x m, then 
(AB)' = BYA’. (2.26) 


Proor. Let C = AB. Then by (2.13) 


P 
C= (ci) = (: cab) ‘ 


k=1 
By (2.3), the transpose of C = AB becomes 
(ABY = C’ = jy = (c;) 


P P 
= b> cabs) = bs ban = BA’. 
k=1 


k=1 


We illustrate the steps in the proof of Theorem 2.2b using a 2 x 3 matrix A and a 
3 x 2 matrix B: 


ay by, + ay2b21 + 413631 a1, b12 + a 2b22 + a13b32 ) 
> 


a,b, + d22b21 + 42363) d21b12 + a22b22 + a73b32 


(ABY = es + ayob21 + a13b31 dab 1 + da2b21 + a23b31 ) 
a11b12 + a2b22 + 413632 d21b12 + a22b22 + a73b32 


bya) + by1412 + 631413 bya, + bz1422 + b31a23 ) 


by2ay1 + by2Q12 + b32413 -Dy2A21 + by2A22 + 53223 


ay, a2, 
( by, bs ) 
— ay ax 
biz ba bp 
13 423 


14 MATRIX ALGEBRA 


The following corollary to Theorem 2.2b gives the transpose of the product of 
three matrices. 


Corollary 1. If A,B, and C are conformal so that ABC is defined, then 
(ABC) = C'B’A’. 


Suppose that A isn x mand B is m x p. Let aj be the ith row of A and b; be the jth 
column of B, so that 


A=]. |, B = (by, bo, ..., bp). 


Then, by definition, the (ij)th element of AB is a‘b;: 


aibi ajbo --- ajbp 
ajb; abo --- abb, 
AB= 
/ / / 
a,b; a,bo --- ab, 


This product can be written in terms of the rows of A: 


ai (b;,bo,...,b,) ai B ai 
a}(b,, bo, Peer b,) a5B a) 

AB = =| |=! — |B. (2.27) 
a}(b;, bo,...,b,) a’ B a’, 


The first column of AB can be expressed in terms of A as 


ai by ai, 

ab, a’, 
=| — |b, =Aby. 

/ i: 

a,b; a, 


Likewise, the second column is Ab, and so on. Thus AB can be written in terms of 
the columns of B: 


AB = A(bi, bo, ...,b,) = (Ab,, Abp,..., Ab,). (2.28) 


2.2. OPERATIONS 15 


Any matrix A can be multiplied by its transpose to form A’A or AA’. Some pro- 
perties of these two products are given in the following theorem. 


Theorem 2.2c. Let A be any n x p matrix. Then A’A and AA’ have the following 
properties. 


(i) A’A is p x p and its elements are products of the columns of A. 
(ii) AA’ is n x n and its elements are products of the rows of A. 
(iii) Both A’A and AA’ are symmetric. 
(iv) If A‘A =O, then A= O. 


Let A be ann X n matrix and let D = diag(d), d>,...,d,). In the product DA, the 
ith row of A is multiplied by d;, and in AD, the jth column of A is multiplied by dj. 
For example, if n = 3, we have 


d 0 0O aj, a2 ay43 
DA=|0 ad 0 a2, a2, 23 


0 0 a a3, 432 33 


day, day dy a43 
| dyax,  dxQx7_— da an3 |, (2.29) 
d3a3, d3a32_ d3a33 
ayy, a2 a3 dq, 0 O 
AD = E ao. ax 0 dad O 
431 432 433 0 0 ad 


diay, dyay.  d3a43 
diaz, dyad. d3a23 |, (2.30) 


dia3, dxa32_— 333 
2 
diay, djdgay2_— dd d3.a43 


DAD = | dodiay, = d5ax._— da dar; |. (2.31) 


2 
d3d\a3; d3d2a32_— 3.433 


16 MATRIX ALGEBRA 


Note that DA 4 AD. However, in the special case where the diagonal matrix is the 
identity, (2.29) and (2.30) become 


IA = AI=A. (2.32) 


If A is rectangular, (2.32) still holds, but the two identities are of different sizes. 
If A is a symmetric matrix and y is a vector, the product 


yAy = )—aiy? + > ayyiy; (2.33) 
F i-j 


is called a quadratic form. If x isn x 1, y is p x 1, and A is n x p, the product 


ij 
is called a bilinear form. 


2.2.4 Hadamard Product of Two Matrices or Two Vectors 


Sometimes a third type of product, called the elementwise or Hadamard product, 
is useful. If two matrices or two vectors are of the same size (conformal for addition), 
the Hadamard product is found by simply multiplying corresponding elements: 


ayiby, ay2by2_ +++ Atpbip 

a21b2; ay2bx2 +++ Arpbr» 
(abi) = 

An bn\ An2Pn2 aie AnpPnp 


2.3 PARTITIONED MATRICES 
It is sometimes convenient to partition a matrix into submatrices. For example, a par- 


titioning of a matrix A into four (square or rectangular) submatrices of appropriate 
sizes can be indicated symbolically as follows: 


Ai Ar 
A= : 
te i 


2.3 PARTITIONED MATRICES 17 


To illustrate, let the 4 x 5 matrix A be partitioned as 


Pf PD °5i| A 
—3 4 0/ 2 7 Ce =) 

os AN Ase he)? 
GGG |S: 8 ee 
3 12/1 6 


where 


an=( aoe ae Ay = i. a: 

—3 4 0 2°} 

rine c 3 ae ios (; ee) 
3 1 2 1 6 


If two matrices A and B are conformal for multiplication, and if A and B are parti- 
tioned so that the submatrices are appropriately conformal, then the product AB can 


be found using the usual pattern of row by column multiplication with the subma- 
trices as if they were single elements; for example 


A A B B 
ap =( ul i 7 *) 
Aa; Ax Bo By 
oe + Ay2Bo; AmB + AB ) 
Aoi By, + Ao2Bo; Ax) Bir + Ar. Boo J” 


(2.35) 


If B is replaced by a vector b partitioned into two sets of elements, and if A is 
correspondingly partitioned into two sets of columns, then (2.35) becomes 


Ab = (Aj, Ao) ( 2 oA Ass: (2.36) 


where the number of columns of A, is equal to the number of elements of b,, and A» 
and by are similarly conformal. Note that the partitioning in A = (Aj, A2) is indicated 
by a comma. 

The partitioned multiplication in (2.36) can be extended to individual columns of 
A and individual elements of b: 


by 
bo 
Ab = (a), @2,..., ap) ; = bya, + boan + +++ Dyap. (2.37) 


bp 


18 MATRIX ALGEBRA 


Thus Ab is expressible as a linear combination of the columns of A, in which the 
coefficients are elements of b. We illustrate (2.37) in the following example. 


Example 2.3. Let 


6 —2 3 4 
A= | 2 1 Of], b= 2 
4 3.2 —l 
Then 
17 
Ab = | 10 
20 


Using a linear combination of columns of A as in (2.37), we obtain 


Ab = bya; + boaz + doa3 


6 2 3 
=4/2]+2! 1]-|0 
4 3 2 


24 —4 3 17 
={ 8] + 2/—-—];0]= 1] 10 
16 2 20 


By (2.28) and (2.37), the columns of the product AB are linear combinations of the 
columns of A. The coefficients for the jth column of AB are the elements of the jth 
column of B. 

The product of a row vector and a matrix, a'B, can be expressed as a linear com- 
bination of the rows of B, in which the coefficients are elements of a’: 


bj 
bs 
a'B = (a),a,...,dn)| | | = ab + anb, +---+ a,b). (2.38) 


b! 
By (2.27) and (2.38), the rows of the matrix product AB are linear combinations 


of the rows of B. The coefficients for the ith row of AB are the elements of the ith 
row of A. 


2.4 RANK 19 


Finally, we note that if a matrix A is partitioned as A = (Aj, Ad), then 


KS apAyS Gal (2.39) 
2 


2.4 RANK 


Before defining the rank of a matrix, we first introduce the notion of linear indepen- 
dence and dependence. A set of vectors aj, a2,..., a, 1S said to be linearly dependent 
if scalars c,,C2,...,C, (not all zero) can be found such that 


cya; + 2a + +--+ c,a, = 0. (2.40) 


If no coefficients cj,c2,...,¢C, can be found that satisfy (2.40), the set of vectors 
@|,@,...,@, is said to be linearly independent. By (2.37) this can be restated as 
follows. The columns of A are linearly independent if Ac = 0 implies c = 0. (If a 
set of vectors includes 0, the set is linearly dependent.) If (2.40) holds, then at 
least one of the vectors a; can be expressed as a linear combination of the other 
vectors in the set. Among linearly independent vectors there is no redundancy of 
this type. 
The rank of any square or rectangular matrix A is defined as 


rank(A) = number of linearly independent columns of A 


= number of linearly independent rows of A. 


It can be shown that the number of linearly independent columns of any matrix is 
always equal to the number of linearly independent rows. 

If a matrix A has a single nonzero element, with all other elements equal to 0, then 
rank(A) = 1. The vector 0 and the matrix O have rank 0. 

Suppose that a rectangular matrix A isn x p of rank p, where p < n. (We typically 
shorten this statement to “A isn x p of rank p < n.”) Then A has maximum possible 
rank and is said to be of full rank. In general, the maximum possible rank of ann x p 
matrix A is min(n, p). Thus, in a rectangular matrix, the rows or columns (or both) are 
linearly dependent. We illustrate this in the following example. 


Example 2.4a. The rank of 


20 MATRIX ALGEBRA 


is 2 because the two rows are linearly independent (neither row is a multiple of the 
other). Hence, by the definition of rank, the number of linearly independent columns 
is also 2. Therefore, the columns are linearly dependent, and by (2.40) there exist 
constants c,,C2, and c3 such that 


(5 nao) ea 


C1 
ly =o: 3 0 
e 5 i) c) =(3) or Ac=0. (2.42) 


The solution to (2.42) is given by any multiple of e = (14, —11, —12)’. In this case, 
the product Ac is equal to 0, even though A 4 Oande ¥ 0. This is possible because 
of the linear dependence of the column vectors of A. 


We can extend (2.42) to products of matrices. It is possible to find A # O and 
B ¥ O such that 


AB =O; (2.43) 


for example 


( a)(-i 3)=(6 0): 


We can also exploit the linear dependence of rows or columns of a matrix to create 
expressions such as AB = CB, where A ¥ C. Thus in a matrix equation, we cannot, 
in general, cancel a matrix from both sides of the equation. There are two exceptions 
to this rule: (1) if B is a full-rank square matrix, then AB = CB implies A = C; (2) 
the other special case occurs when the expression holds for all possible values of the 
matrix common to both sides of the equation; for example 


if Ax = Bx for all possible values of x, (2.44) 


then A = B. To see this, let x = (1,0,..., 0)’. Then, by (2.37) the first column of A 
equals the first column of B. Now let x = (0,1,0,...,0)', and the second column of 
A equals the second column of B. Continuing in this fashion, we obtain A = B. 


2.5 INVERSE 21 


Example 2.4b. We illustrate the existence of matrices A, B, and C such that 
AB = CB, where A + C. Let 


Then 


The following theorem gives a general case and two special cases for the rank of a 
product of two matrices. 


Theorem 2.4 


(i) If the matrices A and B are conformal for multiplication, then rank(AB) < 
rank(A) and rank(AB) < rank(B). 


(ii) Multiplication by a full—rank square matrix does not change the rank; that is, 
if B and C are full—rank square matrices, rank(AB) = rank(CA) = rank(A). 


(iti) For any matrix A, rank(A’A) = rank(AA’) = rank(A’) = rank(A). 
PROOF 


(i) All the columns of AB are linear combinations of the columns of A (see a 
comment following Example 2.3). Consequently, the number of linearly 
independent columns of AB is less than or equal to the number of linearly 
independent columns of A, and rank(AB) < rank(A). Similarly, all the 
rows of AB are linear combinations of the rows of B [see a comment follow- 
ing (2.38)], and therefore rank(AB) < rank(B). 


(ii) This will be proved later. 
(iii) This will also be proved later. 


2.5 INVERSE 


A full-rank square matrix is said to be nonsingular. A nonsingular matrix A has a 
unique inverse, denoted by A~!, with the property that 


AA'=ATA=IL (2.45) 


22 MATRIX ALGEBRA 


If A is square and less than full rank, then it does not have an inverse and is said to be 
singular. Note that full-rank rectangular matrices do not have inverses as in (2.45). 
From the definition in (2.45), it is clear that A is the inverse of A7!: 


(AT!) =A, (2.46) 


Example 2.5. Let 


Then 


and 


ee hes aoa a) 


We can now prove Theorem 2.4(ii). 


Proor. If B is a full-rank square (nonsingular) matrix, there exists a matrix B~! such 
that BB~! = I. Then, by Theorem 2.4(i), we have 


rank(A) = rank(ABB™!) < rank(AB) < rank(A). 


Thus both inequalities become equalities, and rank(A) = rank(AB). Similarly, 
rank(A) = rank(CA) for C nonsingular. 


In applications, inverses are typically found by computer. Many calculators also 
compute inverses. Algorithms for hand calculation of inverses of small matrices 
can be found in texts on matrix algebra. 

If B is nonsingular and AB = CB, then we can multiply on the right by B"! 
to obtain A=C. (Uf B is singular or rectangular, we can’t cancel it from 
both sides of AB = CB; see Example 2.4b and the paragraph preceding the 
example.) Similarly, if A is nonsingular, the system of equations Ax = c has the 
unique solution 


x=A''e, (2.47) 


2.5 INVERSE 23 
since we can multiply on the left by A~! to obtain 
A'Ax=A7'e 


Ix=A'le. 


Two properties of inverses are given in the next two theorems. 


Theorem 2.5a. If A is nonsingular, then A’ is nonsingular and its inverse can be 
found as 


(A’)! = (ATTY, (2.48) 


Theorem 2.5b. If A and B are nonsingular matrices of the same size, then AB is 
nonsingular and 


(AB)! = BAT, (2.49) 


We now give the inverses of some special matrices. If A is symmetric and nonsin- 
gular and is partitioned as 
Ay A 
ee 12) 
Ao; Ax 


and if B = A» — A> 1A An, then, provided Ail and B™! exist, the inverse of A is 
given by 


—B"'A,)A;] 
As a special case of (2.50), consider the symmetric nonsingular matrix 
a & ajo ). 
ain 422 


in which Aj, is square, a2 is a1 x 1 matrix, and aj is a vector. Then if Al. exists, 
A”! can be expressed as 


1 -1 —1 ! -1 _a-l 
Ate ; os oF oe Au a2 ), (2.51) 
=a 


24 MATRIX ALGEBRA 


where b = do) — aA; ai. As another special case of (2.50), we have 


= 
Aj, O Aj O 

= : 2.52 
(“5 a) ( O Ay i 


If a square matrix of the form B + ce’ is nonsingular, where c is a vector and B is a 
nonsingular matrix, then 


B 'ce’B"! 
Bice’) '=B!— ; 2.53 
‘ 1+cB'ec ee 
In more generality, if A, B, and A + PBQ are nonsingular, then 
(A+ PBQ)! = A7! — A”! PB(B + BQA7' PB) 'BQA™!. (2.54) 


Both (2.53) and (2.54) can be easily verified (Problems 2.33 and 2.34). 


2.6 POSITIVE DEFINITE MATRICES 


Quadratic forms were introduced in (2.33). For example, the quadratic form 
3yt + y5 + 2y3 + 4yiy2 + Syiy3 — 6y2y3 can be expressed as 


By} + y3 + 2y3 + 4y1y2 + Syiy3 — 6yay3 = y’Ay, 


where 


y1 3.4 5 
y=(w], A=[0 1 -6 
Y3 0 0 2 


However, the same quadratic form can also be expressed in terms of the symmetric 
matrix 


1 
-(A+A’)= 
5 + A’) 


NIN WW 
_ 
| 

DN Wroin 


2.6 POSITIVE DEFINITE MATRICES 25 


In general, any quadratic form y’Ay can be expressed as 


A+A’ 
yay=y'( a )y (2.55) 


and thus the matrix of a quadratic form can always be chosen to be symmetric (and 
thereby unique). 

The sums of squares we will encounter in regression (Chapters 6-11) and 
analysis—of—variance (Chapters 12—15) can be expressed in the form y’Ay, where 
y is an observation vector. Such quadratic forms remain positive (or at least nonne- 
gative) for all possible values of y. We now consider quadratic forms of this type. 

If the symmetric matrix A has the property y’Ay > 0 for all possible y except 
y = 0, then the quadratic form y’Ay is said to be positive definite, and A is said to 
be a positive definite matrix. Similarly, if y’Ay > 0 for all y and there is at least 
one y # 0 such that y’Ay = 0, then y’Ay and A are said to be positive semidefinite. 
Both types of matrices are illustrated in the following example. 


Example 2.6. To illustrate a positive definite matrix, consider 


2 -1 
ae) 
and the associated quadratic form 


y Ay = 2y} — 2yiy2 + 3y3 = 201 — Ayn)? +395, 


which is clearly positive as long as y; and y> are not both zero. 
To illustrate a positive semidefinite matrix, consider 


(2y1 — y2)” + By1 — ys)” + By2 — 2y3)’, 


which can be expressed as y’Ay, with 


13) -=2°°=3 
A={|{-2 10 -6 
—3 -6 5 


If 2y,=y2,3y1=y3, and 3y.=2y3, then (2y; —y2)? + By — 3)? + 
(By2 — 2y3)? =0. Thus y'‘Ay =0 for any multiple of y = (1, 2, 3)’. Otherwise 
y Ay > 0 (except for y = 0). 


26 MATRIX ALGEBRA 

In the matrices in Example 2.6, the diagonal elements are positive. For positive 
definite matrices, this is true in general. 

Theorem 2.6a 


(i) If A is positive definite, then all its diagonal elements a; are positive. 
(ii) If A is positive semidefinite, then all aj; > 0. 


PROOF 


(i) Let y’ = (0,...,0,1,0,...,0) with a 1 in the ith position and 0’s elsewhere. 
Then y’Ay = a;; > 0. 

(ii) Let y’ = (0,...,0,1,0,...,0) with a 1 in the ith position and 0’s elsewhere. 
Then y’Ay = aj > 0. 


Some additional properties of positive definite and positive semidefinite matrices 
are given in the following theorems. 


Theorem 2.6b. Let P be a nonsingular matrix. 


(i) If A is positive definite, then P’AP is positive definite. 
(ii) If A is positive semidefinite, then P’AP is positive semidefinite. 


PROOF 


(i) To show that y'P/APy > 0 for y 4 0, note that y’/(P’AP)y = (Py)A(Py). 
Since A is positive definite, (Py)’A(Py) > 0 provided that Py 4 0. By 
(2.47), Py=0 only if y=0, since P-'Py=P-'0=0. Thus 
y'P’APy > Oify 4 0. 

(ii) See problem 2.36. 


Corollary 1. Let A be ap x p positive definite matrix and let B be ak x p matrix of 
rank k < p. Then BAB’ is positive definite. 


Corollary 2. Let A be a p x p positive definite matrix and let B be a k x p matrix. 
If k>p or if rank(B)=r, where r<k and r<p, then BAB’ is positive 
semidefinite. 


Theorem 2.6c. A symmetric matrix A is positive definite if and only if there exists a 
nonsingular matrix P such that A = P’P. 


2.6 POSITIVE DEFINITE MATRICES 27 
Proor. We prove the “if” part only. Suppose A = P’P for nonsingular P. Then 
y Ay = y'P’Py = (Py)(Py). 


This is a sum of squares [see (2.20)] and is positive unless Py = 0. By (2.47), Py = 0 
only if y= 0. 


Corollary 1. A positive definite matrix is nonsingular. 


One method of factoring a positive definite matrix A into a product P’P as in 
Theorem 2.6c is provided by the Cholesky decomposition (Seber and Lee 2003, 
pp. 335-337), by which A can be factored uniquely into A = T’T, where T is a non- 
singular upper triangular matrix. 

For any square or rectangular matrix B, the matrix B’B is positive definite or posi- 
tive semidefinite. 


Theorem 2.6d. Let B be an n x p matrix. 


(i) If rank(B) = p, then B’B is positive definite. 
(ii) If rank(B) < p, then B’B is positive semidefinite. 


PROOF 
(i) To show that y’B'By > 0 for y # 0, we note that 
y'B'By = (By) (By), 


which is a sum of squares and is thereby positive unless By = 0. By (2.37), 
we can express By in the form 


By = yib; + yob2 +--+ + ypbp. 
This linear combination is not 0 (for any y 4 0) because rank(B) = p, and 
the columns of B are therefore linearly independent [see (2.40)]. 
di) If rank(B) < p, then we can find y ¥ 0 such that 


By = yib; + y2b2 +--+ + ypbp = 0 


since the columns of B are linearly dependent [see (2.40)]. Hence 
y'B'By > 0. 


28 MATRIX ALGEBRA 


Note that if B is a square matrix, the matrix BB = B? is not necessarily positive 
semidefinite. For example, let 


Then 


»_(-1 2 oes oe 
w=(") 3). B= -4 8) 


In this case, B? is not positive semidefinite, but B’B is positive semidefinite, since 
y B'By = 2(y, — 2y2)’. 

Two additional properties of positive definite matrices are given in the following 
theorems. 


Theorem 2.6e. If A is positive definite, then A’! is positive definite. 
Proor. By Theorem 2.6c, A = P’P, where P is nonsingular. By Theorems 2.5a and 


2.5b, AT! =(P’P)!'=P-|(P’) |! =P |(P"!)’, which is positive definite by 
Theorem 2.6c. 


Theorem 2.6f. If A is positive definite and is partitioned in the form 
An Ap 
A= ; 
( Ay; Ago 
where A,; and Ag» are square, then A;, and A,» are positive definite. 


Proor. We can write A,,, for example, as Aj; = (I, O)A ( a) , Where I is the same 


size as Aj;. Then by Corollary 1 to Theorem 2.6b, Aj, is positive definite. 


2.7 SYSTEMS OF EQUATIONS 


The system of n (linear) equations in p unknowns 


1X1 1 412X2 TT AlpXp = C1 


1X1 + Ar2X2 + +++ + ArpXp = C2 


anix1 + an2X2 toe AnpXp = Cn (2.56) 


2.7 SYSTEMS OF EQUATIONS 29 


can be written in matrix form as 
Ax =C¢, (2.57) 


where A isn Xx p, xis p X 1, andcisn x 1. Note thatifn 4 p, x and c are of differ- 
ent sizes. If n = p and A is nonsingular, then by (2.47), there exists a unique solution 
vector x obtained as x = A-'e. Ifn > p, so that A has more rows than columns, then 
Ax = ¢ typically has no solution. If n < p, so that A has fewer rows than columns, 
then Ax = ¢ typically has an infinite number of solutions. 

If the system of equations Ax = c has one or more solution vectors, it is said to be 
consistent. If the system has no solution, it is said to be inconsistent. 

To illustrate the structure of a consistent system of equations Ax = c, suppose that 
A is p X p of rank r < p. Then the rows of A are linearly dependent, and there exists 
some b such that [see (2.38)] 


b’A = bya + boay +--+ + bya, = 0". 


Then we must also have b’e = bc) + boc. +--+ + byCp = O, since multiplication of 
Ax = ¢ by Db’ gives b’Ax = b’e, or 0’x = b’c. Otherwise, if b’e 4 0, there is no x 
such that Ax = c. Hence, in order for Ax = c¢ to be consistent, the same linear 
relationships, if any, that exist among the rows of A must exist among the elements 
(rows) of c. This is formalized by comparing the rank of A with the rank of the aug- 
mented matrix (A, c). The notation (A, c) indicates that c has been appended to A as 
an additional column. 


Theorem 2.7 The system of equations Ax = c has at least one solution vector x if 
and only if rank(A) = rank(A, c). 


Proor. Suppose that rank(A) = rank(A, c), so that appending c does not change the 
rank. Then c is a linear combination of the columns of A; that is, there exists some x 
such that 


Xjay + X2A2 + +++ + Xpap = C, 


which, by (2.37), can be written as Ax = c. Thus x is a solution. 

Conversely, suppose that there exists a solution vector x such that Ax = c. In 
general, rank (A) < rank(A,c) (Harville 1997, p. 41). But since there exists an x 
such that Ax = c, we have 


rank(A, c) = rank(A, Ax) = rank[A(I, x)] 
< rank(A) [by Theorem 2.4(i)]. 


30 MATRIX ALGEBRA 


Hence 


rank(A) < rank(A, ¢) < rank(A), 


and we have rank(A) = rank(A, c). 

A consistent system of equations can be solved by the usual methods given in 
elementary algebra courses for eliminating variables, such as adding a multiple of 
one equation to another or solving for a variable and substituting into another 
equation. In the process, one or more variables may end up as arbitrary constants, 
thus generating an infinite number of solutions. A method of solution involving gen- 
eralized inverses is given in Section 2.8.2. Some illustrations of systems of equations 
and their solutions are given in the following examples. 


Example 2.7a. Consider the system of equations 
Xx, + 2x. =4 
XxX, —X%2 = 1 
Xp +x =3 


or 


The augmented matrix is 
1 2 4 
(Ajc)=] 1 -1 1], 
1 1 3 


which has rank = 2 because the third column is equal to twice the first column plus 
the second: 


1 2 
9 +{-l1}=]1 
1 3 


Since rank(A) = rank(A, c) = 2, there is at least one solution. If we add twice the first 
equation to the second, the result is a multiple of the third equation. Thus the third 
equation is redundant, and the first two can readily be solved to obtain the unique 
solution x = (2, 1y. 


2.7 SYSTEMS OF EQUATIONS 31 


Figure 2.1 Three lines representing the three equations in Example 2.7a. 


The three lines representing the three equations are plotted in Figure 2.1. Notice 
that the three lines intersect at the point (2, 1), which is the unique solution of the 
three equations. 


Example 2.7b. If we change the 3 to 2 in the third equation in Example 2.7, the aug- 
mented matrix becomes 


i, 3 
(Ac=[1 -1 
ii? ai 


which has rank = 3, since no linear combination of columns is 0. [Alternatively, 
\(A, c)| 4 0, and (A,c) is nonsingular; see Theorem 2.9(iii)] Hence rank (A, c) = 
3 ¥ rank(A) = 2, and the system is inconsistent. 

The three lines representing the three equations are plotted in Figure 2.2, in which we 
see that the three lines do not have a common point of intersection. [For the “best” 
approximate solution, one approach is to use least squares; that is, we find the values 
of x; and x> that minimize (x; + 2x. — 4° + (4) — x9 — 1% + 1 +2 — 2)7] 


Example 2.7c. Consider the system 


Xt +2x3=1 
2x1 +42 + 3x3 =5 
3x1 + 2x9 TT 4x3 = 6. 


32 MATRIX ALGEBRA 


Figure 2.2 Three lines representing the three equations in Example 2.7b. 


The third equation is the sum of the first two, but the second is not a multiple of the 
first. Thus, rank(A, c) = rank(A) = 2, and the system is consistent. 
By solving the first two equations for x, and x, in terms of x3, we obtain 


xy = —2x%3+4 


xX = x3 —3. 


The solution vector can be expressed as 


—2x3+4 =) 4 
x= x3 3 = X3 1 + —3 > 
X3 1 0) 


where x3 is an arbitrary constant. Geometrically, x is the line representing the inter- 
section of the two planes corresponding to the first two equations. 


2.8 GENERALIZED INVERSE 


We now consider generalized inverses of those matrices that do not have inverses in 
the usual sense [see (2.45)]. A solution of a consistent system of equations Ax = ¢ 
can be expressed in terms of a generalized inverse of A. 


2.8 GENERALIZED INVERSE 33 


2.8.1 Definition and Properties 


A generalized inverse of ann x p matrix A is any matrix A that satisfies 
AA A=A. (2.58) 


A generalized inverse is not unique except when A is nonsingular, in which case 
A” =A!.A generalized inverse is also called a conditional inverse. 

Every matrix, whether square or rectangular, has a generalized inverse. This holds 
even for vectors. For example, let 


BRWN Re 


Then x; = (1,0,0, 0) is a generalized inverse of x satisfying (2.58). Other examples 


are x, = (0, 550, 0),x; = (0,0, 5,0), and x; = (0,0,0, i). For each x; , we have 


xx,x=xl=x, i=1,2,3,4. 


In this illustration, x is a column vector and x; is a row vector. This pattern is 
generalized in the following theorem. 


Theorem 2.8a. If A is n x p, any generalized inverse A’ is p X n. 


In the following example we give two illustrations of generalized inverses of a 
singular matrix. 


Example 2.8.1. Let 
22: 3 
A=[{1 0 1]. (2.59) 
3 2 4 


The third row of A is the sum of the first two rows, and the second row is not a mul- 
tiple of the first; hence A has rank 2. Let 


0 1 0 0 (0 
A,={3 -1 0}, AZ ={0 -3 5 (2.60) 
0 00 0 0 0 


It is easily verified that AA; A = A and AA, A= A. 


34 MATRIX ALGEBRA 


The methods used to obtain A; and A, in (2.60) are described in Theorem 2.8b 
and the five-step algorithm following the theorem. 


Theorem 2.8b. Suppose A is n x p of rank r and that A is partitioned as 
An Ai 
A= ; 
6: A ) 
where Aj; is r x r of rank r. Then a generalized inverse of A is given by 
2. Ae O 
a ul 
(3 9) 
where the three O matrices are of appropriate sizes so that A’ is p x n. 


Proor. By multiplication of partitioned matrices, as in (2.35), we obtain 


7 I O Aut Ay 
AA“ A= e A= z ; 
bane! a ee ea) 


To show that Ao; A;!Ai2 = Aso, multiply A by 


I 6) 
B= é , 
ere I ) 


where O and I are of appropriate sizes, to obtain 


Aut Ai 
BA = _ ; 
( O An-— oe) 


The matrix B is nonsingular, and the rank of BA is therefore r= rank(A) [see 


Theorem 2.4(ii)]. In BA, the submatrix ee is of rank r, and the columns 


headed by Aj» are therefore linear combinations of the columns headed by Aj;. By 
a comment following Example 2.3, this relationship can be expressed as 


Ap An 
7 = 2.61 
ee nec ( O Jo ee) 


2.8 GENERALIZED INVERSE 35 


for some matrix Q. By (2.27), the right side of (2.61) becomes 


Au Q- AiQ)\ _ ( AnQ 
O OQ Oo } 
Thus Ax. — Ay} Az Ai = O, or 


Ay = An Ay Av. 


Corollary 1. Suppose that A is n x p of rank r and that A is partitioned as in 
Theorem 2.8b, where A» is r x r of rank r. Then a generalized inverse of A is 
given by 


a Oo oO 
x “(6 ra 


where the three O matrices are of appropriate sizes so that A” is p x n. 


The nonsingular submatrix need not be in the Aj, or Ao» position, as in Theorem 
2.8b or its corollary. Theorem 2.8b can be extended to the following algorithm for 
finding a conditional inverse A’ for any n x p matrix A of rank r (Searle 1982, 
p. 218): 


1. Find any nonsingular r x r submatrix C. It is not necessary that the elements 
of C occupy adjacent rows and columns in A. 


. Find C7! and (C"lY. 
. Replace the elements of C by the elements of (C~'Y’. 
. Replace all other elements in A by zeros. 


an BW WN 


. Transpose the resulting matrix. 


Some properties of generalized inverses are given in the following theorem, which 
is the theoretical basis for many of the results in Chapter 11. 


Theorem 2.8c. Let A ben x p of rank r, let A” be any generalized inverse of A, and 
let (A’A) be any generalized inverse of A’A. Then 


(i) rank(A’ A) = rank(AA ) = rank(A) = r. 
(ii) (A”) is a generalized inverse of A’; that is, (A’) = (AY. 
(iii) A = A(A‘A)~A‘A and A’ = A’A(A‘A)— AS 
(iv) (A’A)~ A’ is a generalized inverse of A; that is, AT = (A’A) A’. 


36 MATRIX ALGEBRA 


(v) A(A‘A)A’ is symmetric, has rank = r, and is invariant to the choice of 
(A’A)~; that is, A(A’A)~A’ remains the same, no matter what value of 
(A’A)~ is used. 


A generalized inverse of a symmetric matrix is not necessarily symmetric. 
However, it is also true that a symmetric generalized inverse can always be found 
for a symmetric matrix; see Problem 2.46. In this book, we will assume that gener- 
alized inverses of symmetric matrices are symmetric. 


2.8.2 Generalized Inverses and Systems of Equations 


Generalized inverses can be used to find solutions to a system of equations. 


Theorem 2.8d. If the system of equations Ax = ¢c is consistent and if A is any 
generalized inverse for A, then x = Ac is a solution. 


Proor. Since AA” A = A, we have 


AA’ Ax = Ax. 


Substituting Ax = c on both sides, we obtain 


AA’ c=ce. 


Writing this in the form A(A™c) = ¢c, we see that Ac is a solution to Ax = c. 


Different choices of A~ will result in different solutions for Ax = c. 


Theorem 2.8e. If the system of equations Ax = c is consistent, then all possible sol- 
utions can be obtained in the following two ways: 


(i) Use a specific A” inx = A-e+(I—A Abh, and use all possible values of 
the arbitrary vector h. 
(ii) Use all possible values of AT inx = A°cife # 0. 


Proor. See Searle (1982, p. 238). 


A necessary and sufficient condition for the system of equations Ax = ¢ to be 
consistent can be given in terms of a generalized inverse of A (Graybill 1976, p. 36). 


2.9 DETERMINANTS 37 


Theorem 2.8f. The system of equations Ax = c has a solution if and only if for any 
generalized inverse A’ of A 


AA c=ce. 


Proor. Suppose that Ax = ¢ is consistent. Then, by Theorem 2.8d, x = A’c¢ is a 
solution. Multiply c = Ax by AA‘ to obtain 


AA c= AA Ax= Ax=c. 


Conversely, suppose AA~c = ce. Multiply x = Ave by A to obtain 


Ax = AA c= C. 


Hence, a solution exists, namely, x = Ac. 


Theorem 2.8f provides an alternative to Theorem 2.7a for determining whether a 
system of equations is consistent. 


2.9 DETERMINANTS 


The determinant of ann x n matrix A is a scalar function of A defined as the sum of 
all n! possible products of n elements such that 


1. each product contains one element from every row and every column of A. 

2. the factors in each product are written so that the column subscripts appear 
in order of magnitude and each product is then preceded by a plus or 
minus sign according to whether the number of inversions in the row sub- 
scripts is even or odd. (An inversion occurs whenever a larger number pre- 
cedes a smaller one.) 


The determinant of A is denoted by |A| or det(A). The preceding definition is not 
very useful in evaluating determinants, except in the case of 2 x 2 or 3 x 3 matrices. 
For larger matrices, determinants are typically found by computer. Some calculators 
also evaluate determinants. 

The determinants of some special square matrices are given in the following 
theorem. 


Theorem 2.9a. 


(i) If D= diag(d),d>,...,d,),  |D| = [[jL, di. 


38 MATRIX ALGEBRA 


(i) The determinant of a triangular matrix is the product of the diagonal 
elements. 


(iii) If Ais singular, |A| = 0. 

(iv) If Ais nonsingular, |A| 4 0. 

(v) If Ais positive definite, |A| > 0. 
(vi) [A= IAL ’ 


(vii) If Ais nonsingular, |A~'| = TAL: 


Example 2.9a. We illustrate each of the properties in Theorem 2.9a. 
Reta’ 2 0 
(i) diagonal: 0 3/= (2) (3) — (0) (0) = @2)@). 


(i) triangular: | ; : | = (2)3) — (0) () = @)@). 


(iii) singular: | : ; | = (1) (6) — 3) @) = 0, 


nonsingular: | : ; | (1) (4)— (3) (2) = —2. 


—2 


: ae : 3 
(iv) positive definite: Oh 


[=m --a-2 =8>0, 


3 -7 
(v) transpose: > | 


=(3)0)-@)(-)=17, 
32 
ie {|=@w-C D@=17 


(vi) inverse: 


eee ae ee SD sas 4 -2|_, 
I, ees ee” Sas la ae Neat = Sy 


As a special case of (62), suppose that all diagonal elements are equal, say, 
D = diag(c,c,...,c) = cl. Then 


|D| = |cl| = [J c=c". (2.68) 
i=l 


2.9 DETERMINANTS 39 


By extension, if an m x n matrix is multiplied by a scalar, the determinant becomes 
|cA| = c"|A]. (2.69) 


The determinant of certain partitioned matrices is given in the following theorem. 


Theorem 2.9b. If the square matrix A is partitioned as 
Ai; Ai 
A= ; 2.70 
( Ay; Ayo ) ie 
and if A,,; and Ag, are square and nonsingular (but not necessarily the same size), then 


|A| = |Ani| |Ao2 — Aoi Aq] Ap| (2.71) 


= |Ago| |Ar1 — Ai2A55 Aoi]. (2.72) 


Note the analogy of (2.71) and (2.72) to the case of the determinant of a2 x 2 matrix: 


Corollary 1. Suppose 


Ay O Ay Aj 
A= or A= ’ 
& Ax? ) ( O Ay ) 


where A,; and Ay» are square (but not necessarily the same size). Then in either case 


|A| = |Ai;| [Azo]. (2.73) 


40 MATRIX ALGEBRA 


Corollary 2. Let 


_fAn O 
where Aj; and Ao are square (but not necessarily the same size). Then 


|A| = |Aqy| |Agg]. (2.74) 


An ap 


Corollary 3. If A has the form A = ( : 
ain 422 


i where Aj, is a nonsingular 


matrix, a> is a vector, and a. is a 1 x 1 matrix, then 


Ai ap 
|A| = is 


= |Aqi|(@22 — a, Aq a12)- (2.75) 


Corollary 4. If A has the form A = & a where c is a vector and B is a 
nonsingular matrix, then 


|B + ce’| = |B\(1 + eB 'e). (2.76) 


The determinant of the product of two square matrices is given in the following 
theorem. 


Theorem 2.9c. If A and B are square and the same size, then the determinant of the 
product is the product of the determinants: 


|AB| = |A| |BI. (2.77) 


Corollary 1 


|AB| = |BA|. (2.78) 


Corollary 2 
|A?| = |AP’. (2.79) 


2.10 ORTHOGONAL VECTORS AND MATRICES 41 


Example 2.9b. To illustrate Theorem 2.9c, let 


1 2 3) 2 
a=(5 . and B=(j a 
Then 


a) 
AB = , |AB|=—16, 
(3.3 


|A|=—2, |B] =8, |Al |B] = —16. 


2.10 ORTHOGONAL VECTORS AND MATRICES 


Two n x | vectors b and b are said to be orthogonal if 


ab = aby + dgbg + +++ + Anbyn = 0. (2.80) 


Note that the term orthogonal applies to two vectors, not to a single vector. 
Geometrically, two orthogonal vectors are perpendicular to each other. This is 
illustrated in Figure 2.3 for the vectors x; = (4,2) and x. = (—1,2)’. Note that 
xX. = (4)(-1) + 2) (2) = 0. 
To show that two orthogonal vectors are perpendicular, let 6 be the angle between 
vectors a and b in Figure 2.4. The vector from the terminal point of a to the terminal 
point of b can be represented as ce = b — a. The law of cosines for the relationship of 


(-1,2) (4,2) 


Figure 2.3. Two orthogonal (perpendicular) vectors. 


42 MATRIX ALGEBRA 


y 


Z 


Figure 2.4 Vectors a and b in 3-space. 


0 to the sides of the triangle can be stated in vector form as 


a’a + b’b — (b — a)'(b— a) 
2\/(a’a)(b’b) 
a’a + b’b — (b'b + a’a — 2a'b) 
2,/(a’a)(b’b) 
ae ee (2.81) 


/(a’ay(b’b) 


When 6= 90°, a’b = 0 since cos(90°) = 0. Thus a and b are perpendicular when 
a’b = 0. 

If a’a = 1, the vector a is said to be normalized. A vector b can be normalized by 
dividing by its length, V’b’b. Thus 


cos 0 = 


(2.82) 


is normalized so that c’c = 1. 

A set of p x 1 vectors ¢),¢2,...,¢, that are normalized (ec; = 1 for all i) and 
mutually orthogonal (cje; = 0 for alli 4 j) is said to be an orthonormal set of 
vectors. If the p x p matrix C = (¢;,€2,...,¢,) has orthonormal columns, C is 
called an orthogonal matrix. Since the elements of C’C are products of columns of 


2.10 ORTHOGONAL VECTORS AND MATRICES 43 


C [see Theorem 2.2c(i)], an orthogonal matrix C has the property 

C'C=1. (2.83) 
It can be shown that an orthogonal matrix C also satisfies 

CC =1. (2.84) 


Thus an orthogonal matrix C has orthonormal rows as well as orthonormal columns. 
It is also clear from (2.83) and (2.84) that C’ = C~! if C is orthogonal. 


Example 2.10. To illustrate an orthogonal matrix, we start with 


a | 
A= > 23 
1 -1 


whose columns are mutually orthogonal but not orthonormal. To normalize the 
three columns, we divide by their respective lengths, V3, J/6, and V2, to obtain 
the matrix 


1/V3 1/v6— 1/2 
C= d/y3 =2//6-.. 0 ; 
1/J3 1/v6  —1/V2 


whose columns are orthonormal. Note that the rows of C are also orthonormal, so that 
C satisfies (2.84) as well as (2.83). 


Multiplication of a vector by an orthogonal matrix has the effect of rotating axes; 
that is, if a point x is transformed to z = Cx, where C is orthogonal, then the distance 
from the origin to z is the same as the distance to x: 


Zz = (Cx) (Cx) = x'C'Cx = x'Ix = x’x. (2.85) 


Hence, the transformation from x to z is a rotation. 
Some properties of orthogonal matrices are given in the following theorem. 


Theorem 2.10. If the p x p matrix C is orthogonal and if A is any p x p matrix, then 


(i). |C| =+1 or —1. 


44 MATRIX ALGEBRA 


(ii) |C'AC| = |A]. 
(iii) —1 < cy < 1, where cj is any element of C. 


2.11 TRACE 


The trace of ann Xx n matrix A = (aj) is a scalar function defined as the sum of the 


diagonal elements of A; that is, tr(A) = >", a,. For example, suppose 
8 4 2 
A=|{2 -3 6 
3 5 9 
Then 


tr(A) = 8—-3+9=14. 


Some properties of the trace are given in the following theorem. 


Theorem 2.11 


(i) If A and B are n x n, then 


tr(A + B) = tr(A) + tr(B). 


(ii) If A isn x p and B is p x n, then 
tr(AB) = tr(BA). 


Note that in (2.87) n can be less than, equal to, or greater than p. 
(iii) If A is n x p, then 


P 
tr(A’A) = N° ala, 
i=l 
where a; is the ith column of A. 
(iv) If A isn x p, then 
tr(AA’) = S° ala, 


i=1 


where a’ is the ith row of A. 


(2.86) 


(2.87) 


(2.88) 


(2.89) 


2.11 TRACE 45 


(v) If A = (aj) is an n x p matrix with representative element a,, then 


n P 
tr(A’A) = t(AA’) = SOS - ai. (2.90) 


i=1 j=1 
(vi) If A is any n Xx n matrix and P is any n x n nonsingular matrix, then 
tr(P-! AP) = tr(A). (2.91) 


(vii) If A is any n X n matrix and C is any n x n orthogonal matrix, then 
tr(C’AC) = tr(A). (2.92) 
(viii) If A ism x p of rank rand A is a generalized inverse of A, then 
tr(A~ A) = tr(AA™) =r. (2.93) 
Proor. We prove parts (ii), (iii), and (vi). 
(ii) By (2.13), the ith diagonal element of E = AB is e;; = ar aby. Then 
tr(AB) = tr(E) = Soe = Soo anu. 
i isk 
Similarly, the ith diagonal element of F = BA is fi; = >~ , Pika, and 
tr(BA) = t(F) = Sofi = SoS bincani 
i ie ak 


= SSP agibe = tr) = tr(AB). 
k i 


(iii) By Theorem 2.2c(i), A’A is obtained as products of columns of A. If a; is 
the ith column of A, then the ith diagonal element of A’A is a/aj. 


(vi) By (2.87) we obtain 


tr(P-! AP) = tr(APP~!) = tr(A). 


Example 2.11. We illustrate parts (ii) and (viii) of Theorem 2.11. 


(ii) Let 
! 3 —2 1 
A= : —l and B= € 4 ae 


46 MATRIX ALGEBRA 


Then 
9 10 16 
3. #17 
AB=] 4 -8 -3 |], BA= ' 
30 32 
24 16 34 


tr(AB) = 9-—8+34=35, tr(BA) =3+432= 35. 
(viii) Using A in (2.59) and A; in (2.60), we obtain 


100 
, AA=10 1 Of, 
1 1.0 


CONF 


1 0 
A A=]0 1 
0 0 


tr(A7A) = 1+1+0=2=rank(A), 
tr(AA~) = 1+1+0=2=rank(A). 


2.12 EIGENVALUES AND EIGENVECTORS 


2.12.1 Definition 


For every square matrix A, a scalar A and a nonzero vector x can be found such that 


Ax = Xx, (2.94) 


XQ 


x 


Figure 2.5 An eigenvector x is transformed to Ax. 


2.12 EIGENVALUES AND EIGENVECTORS 47 


where A is an eigenvalue of A and x is an eigenvector. (These terms are sometimes 

referred to as characteristic root and characteristic vector, respectively.) Note that in 

(2.94), the vector x is transformed by A onto a multiple of itself, so that the point 

Ax is on the line passing through x and the origin. This is illustrated in Figure 2.5. 
To find A and x for a matrix A, we write (2.94) as 


(A — ADx = 0. (2.95) 


By (2.37), (A — ADx is a linear combination of the columns of A — AI, and by (2.40) 
and (2.95), these columns are linearly dependent. Thus the square matrix (A — AI) is 
singular, and by Theorem 2.9a(iii), we can solve for A using 


eae) (2.96) 


which is known as the characteristic equation. 

If A is n Xx n, the characteristic equation (2.96) will have n roots; that is, A will 
have n eigenvalues A,, A2,...,A,. The A’s will not necessarily all be distinct, or all 
nonzero, or even all real. (However, the eigenvalues of a symmetric matrix are 
real; see Theorem 2.12c.) After finding A,, A2,..., A, using (2.96), the accompanying 
eigenvectors X;,X2,...,X, can be found using (2.95). 

If an eigenvalue is 0, the corresponding eigenvector is not 0. To see this, note that 
if A = 0, then (A — AI)x = 0 becomes Ax = 0, which has solutions for x because A 
is singular, and the columns are therefore linearly dependent. [The matrix A is singu- 
lar because it has a zero eigenvalue; see (63) and (2.107).] 

If we multiply both sides of (2.95) by a scalar k, we obtain 


k(A — ADx = k0 = 0, 


which can be rewritten as 


(A—ADkx=0 [by (2.12)]. 


Thus if x is an eigenvector of A, kx is also an eigenvector. Eigenvectors are therefore 
unique only up to multiplication by a scalar. (There are many solution vectors x 
because A — AI is singular; see Section 2.8) Hence, the length of x is arbitrary, 
but its direction from the origin is unique; that is, the relative values of (ratios of) 
the elements of x = (x1,x2,...,X,)/ are unique. Typically, an eigenvector x is 
scaled to normalized form as in (2.82), x’x = 1. 


48 MATRIX ALGEBRA 


Example 2.12.1. To illustrate eigenvalues and eigenvectors, consider the matrix 


By (2.96), the characteristic equation is 


1-A 2 


Jaa] =| ms Zee 


1 =(1-—A)\(4-—A)+2=0, 
which becomes 
MW —5A +6 =(A—3)(A—2) =0, 


with roots Ay = 3 and Ay = 2. 
To find the eigenvector x, corresponding to A; = 3, we use (2.95) 


(A — A, Dx; = 0, 
1-3 2 x} 0 
( -1 Pale) : ()) 
which can be written as 
—2x,; + 2x. = 0 
—X, +X = 0. 


The second equation is a multiple of the first, and either equation yields x, = x2. The 
solution vector can be written with x; = x2 = c as an arbitrary constant: 


w= (2)-G) oa) 


If c is set equal to 1/ J2 to normalize the eigenvector, we obtain 


2.12 EIGENVALUES AND EIGENVECTORS 49 


2.12.2 Functions of a Matrix 


If A is an eigenvalue of A with corresponding eigenvector x, then for certain functions 
g(A), an eigenvalue is given by g(A) and x is the corresponding eigenvector of g(A) 
as well as of A. We illustrate some of these cases: 


1. If A is an eigenvalue of A, then cA is an eigenvalue of cA, where c is an arbi- 
trary constant such that c 4 0. This is easily demonstrated by multiplying the 
defining relationship Ax = Ax by c: 


cAX = CAX. (2.97) 


Note that x is an eigenvector of A corresponding to A, and x is also an eigen- 
vector of cA corresponding to cA. 

2. If A is an eigenvalue of the A and x is the corresponding eigenvector of A, 
then cA +k is an eigenvalue of the matrix cA + kI and x is an eigenvector 
of cA + kI, where c and k are scalars. To show this, we add kx to (2.97): 


cAx + kx = cAx + kx, 
(cA + kI)x = (cA + k)x. (2.98) 


Thus cA + k is an eigenvalue of cA + kI and x is the corresponding eigen- 
vector of cA + kI. Note that (2.98) does not extend to A+ B for arbitrary 
n X n matrices A and B; that is, A+ B does not have Ay + Ag for an eigen- 
value, where A, is an eigenvalue of A and Ag is an eigenvalue of B. 

3. If A is an eigenvalue of A, then A” is an eigenvalue of A”. This can be demon- 
strated by multiplying the defining relationship Ax = Ax by A: 


A(Ax) = A(AXx), 


Ax = AAx = A(Ax) = A°x. (2.99) 


Thus A? is an eigenvalue of A’, and x is the corresponding eigenvector of A’. 
This can be extended to any power of A: 


A‘x = Akx; (2.100) 


that is, AX is an eigenvalue of A*, and x is the corresponding eigenvector. 


50 MATRIX ALGEBRA 


4. If A is an eigenvalue of the nonsingular matrix A, then 1/A is an eigenvalue 
of A~!. To demonstrate this, we multiply Ax = Ax by A™! to obtain 


A7!Ax = Av!Ax, 


x =AA™!x, 
ae ott 
A x= oe (2.101) 


Thus 1/A is an eigenvalue of A~!, and x is an eigenvector of both A and A“!. 


5. The results in (2.97) and (2.100) can be used to obtain eigenvalues and eigen- 
vectors of a polynomial in A. For example, if A is an eigenvalue of A, then 


(A? + 4A? — 3A + 5Dx = A®x + 4A’x — 3Ax + 5x 
= x + 4)°x — 3Ax + 5x 
= (24+ 40? — 3A 4+5)x. 


Thus A? + 4A” — 3A +5 is an eigenvalue of A? + 4A? — 3A + 5I, and x is 
the corresponding eigenvector. 


For certain matrices, property 5 can be extended to an infinite series. For example, 
if A is an eigenvalue of A, then, by (2.98), 1 — A is an eigenvalue of I— A. If 
I— A is nonsingular, then, by (2.101), 1/(1 — A) is an eigenvalue of (I— A) !. 
If —1 <A <1, then 1/(1 — A) can be represented by the series 


1 
eee es oe Yeriy ( cee |r ee 
ra ee 


Correspondingly, if all eigenvalues of A satisfy —1 <A < 1, then 
(— A)! =14+A+A74A74.---. (2.102) 


2.12.3 Products 


It was noted in a comment following (2.98) that the eigenvalues of A + B are not of 
the form A, + Ag, where A, is an eigenvalue of A and Ag is an eigenvalue of B. 
Similarly, the eigenvalues of AB are not products of the form A4Ag. However, the 
eigenvalues of AB are the same as those of BA. 


Theorem 2.12a. If A and B are n x n or if A is n x p and B is p x n, then the 
(nonzero) eigenvalues of AB are the same as those of BA. If x is an eigenvector 
of AB, then Bx is an eigenvector of BA. 


2.12 EIGENVALUES AND EIGENVECTORS 51 


Two additional results involving eigenvalues of products are given in the follow- 
ing theorem. 


Theorem 2.12b. Let A be any n x n matrix. 


(i) If P is any n x n nonsingular matrix, then A and P ‘AP have the same 
eigenvalues. 

(ii) If C is any n x n orthogonal matrix, then A and C’AC have the same 
eigenvalues. 


2.12.4 Symmetric Matrices 


Two properties of the eigenvalues and eigenvectors of a symmetric matrix are given 
in the following theorem. 


Theorem 2.12c. Let A be an n x n symmetric matrix. 


(i) The eigenvalues A,, A2,..., A, of A are real. 
(ii) The eigenvectors x), X2,...,x, of A corresponding to distinct eigenvalues 
Aji,A2,.--,Ax are mutually orthogonal; the eigenvectors X41, Xx42,---,Xn 
corresponding to the nondistinct eigenvalues can be chosen to be mutually 
orthogonal to each other and to the other eigenvectors; that is, x/x; = 0 for 
i Fj. 


If the eigenvectors of a symmetric matrix A are normalized and placed as columns 
of a matrix C, then by Theorem 2.12c(ii), C is an orthogonal matrix. This orthogonal 
matrix can be used to express A in terms of its eigenvalues and eigenvectors. 


Theorem 2.12d. If A is ann x n symmetric matrix with eigenvalues 2), A2,..., An 
and normalized eigenvectors x;,X2,...,X,, then A can be expressed as 
A=CDC (2.103) 
=> Axx, (2.104) 
i=l 


where D = diag(A;,A2,...,A,) and C is the orthogonal matrix C = (x,,X2,...,Xy). 
The result in either (2.103) or (2.104) is often called the spectral decomposition of A. 


Proor. By Theorem 2.12c(ii), C is orthogonal. Then by (2.84), I = CC’, and multi- 
plication by A gives 


A= ACC’. 


52 MATRIX ALGEBRA 


We now substitute C = (|, X2,...,X,) to obtain 


A= A(X1,X%,...,X)C 
=(Ax,,Ax%,...,Ax,)C’ [by (2.28)] 
= (AyX1,AoX2, 225 AnXJC’ [by (2.94) ] 
= cnc, (2.105) 


since multiplication on the right by D = diag(Aj, Az, ..., A,) multiplies columns of C 
by elements of D [see (2.30)]. Now writing C’ in the form 


x) 
x’, 


2 
C’ = (X1,%2,---,%n) =] [by (2.39)], 


(2.105) becomes 


A = (Ai x1, A2X2, TS eo AnXn) 
/ 


Xn 


/ / / 
= ApX1X, + A2K2X, +++: + ApXnX,,- 


Corollary 1. If A is symmetric and C and D are defined as in Theorem 2.12d, then C 
diagonalizes A: 


C’AC =D. (2.106) 


We can express the determinant and trace of a square matrix A in terms of its 
eigenvalues. 


Theorem 2.12e. If A is any n x n matrix with eigenvalues Aj, A2,..., An, then 
” |A| = [ai (2.107) 
i=l 
(ii) : 
tr(A) = S° Aj. (2.108) 
i=l 


We have included Theorem 2.12e here because it is easy to prove for a symmetric 
matrix A using Theorem 2.12d (see Problem 2.72). However, the theorem is true for 
any square matrix (Searle 1982, p. 278). 


2.12. EIGENVALUES AND EIGENVECTORS 53 


Example 2.12.4. To illustrate Theorem 2.12e, consider the matrix A in Example 
2.12.1 


which has eigenvalues A; = 3 and Az = 2. The product A;Az = 6 is the same as 
|AJ=4—(—1)(2)=6. The sum A; +A,=3+2=5 is the same as 
tr(A) =1+4=5S. 


2.12.5 Positive Definite and Semidefinite Matrices 


The eigenvalues Aj, A2,..., A, of positive definite and positive semidefinite matrices 
(Section 2.6) are positive and nonnegative, respectively. 


Theorem 2.12f. Let A be n x n with eigenvalues Aj, A2,..., An. 
(i) If A is positive definite, then A; > 0 for i = 1,2,...,n. 
(ii) If A is positive semidefinite, then A; > 0 for i = 1,2,...,n. The number of 
eigenvalues A; for which A; > 0 is the rank of A. 
PRroor. 


(i) For any A;, we have Ax; = A;x;. Multiplying by x’, we obtain 


/ / 
x; AX; = AjiX;Xi, 


xX AX; 


Xj >0 


y 
X;Xj 


In the second expression, x/Ax; is positive because A is positive definite, and 
x/x; is positive because x; # 0. 


If a matrix A is positive definite, we can find a square root matrix A!/? as follows. 
Since the eigenvalues of A are positive, we can substitute the square roots \/A; for A; 
in the spectral decomposition of A in (2.103), to obtain 


Al? — cp!?C, (2.109) 


where D!/? = diag(/A;, V/A2,...,.V/An). The matrix A!/? is symmetric and has the 
property 


AVPal? = (AlPP aa, (2.110) 


54 MATRIX ALGEBRA 
2.13 IDEMPOTENT MATRICES 


A square matrix A is said to be idempotent if A7 = A. Most idempotent matrices in 
this book are symmetric. Many of the sums of squares in regression (Chapters 6-11) 
and analysis of variance (Chapters 12—15) can be expressed as quadratic forms y’ Ay. 
The idempotence of A or of a product involving A will be used to establish that y'Ay 
(or a multiple of y’Ay) has a chi-square distribution. 

An example of an idempotent matrix is the identity matrix I. 


Theorem 2.13a. The only nonsingular idempotent matrix is the identity matrix I. 


Proor. If A is idempotent and nonsingular, then A? = A and the inverse A7! exists. 
If we multiply A7 = A by A~!, we obtain 
A'A?=A7!A, 
A=I. 


Many of the matrices of quadratic forms we will encounter in later chapters are 
singular idempotent matrices. We now give some properties of such matrices. 


Theorem 2.13b. If A is singular, symmetric, and idempotent, then A is positive 
semidefinite. 


Proor. Since A = A’ and A = A’, we have 


A=A?=AA=A‘/A, 


which is positive semidefinite by Theorem 2.6d(ii). 


If a is a real number such that a” = a, then a is either 0 or 1. The analogous prop- 
erty for matrices is that if A? = A, then the eigenvalues of A are Os and Is. 


Theorem 2.13c. If A is ann x n symmetric idempotent matrix of rank r, then A has 
r eigenvalues equal to 1 and n — r eigenvalues equal to 0. 


Proor. By (2.99), if Ax=Ax, then A’x=A’x. Since A? =A, we have 
A’x = Ax = Ax. Equating the right sides of A*x = Ax and A’x = Ax, we have 


Ax=A’x or (A—A*)x=0. 


But x # 0, and therefore A — A? = 0, from which, A is either 0 or 1. 

By Theorem 2.13b, A is positive semidefinite, and therefore by Theorem 2.12f(ii), 
the number of nonzero eigenvalues is equal to rank(A). Thus r eigenvalues of A are 
equal to 1 and the remaining n — r eigenvalues are equal to 0. 


2.13. IDEMPOTENT MATRICES 55 


We can use Theorems 2.12e and 2.13c to find the rank of a symmetric idempotent 
matrix. 


Theorem 2.13d. If A is symmetric and idempotent of rank r, then rank(A) = 
tr(A) =r. 


Proor. By Theorem 2.12e(ii), tr(A) = ea A; and by Theorem 2.13c, 
ear 


Some additional properties of idempotent matrices are given in the following four 
theorems. 


Theorem 2.13e. If A is an n x n idempotent matrix, P is an n x n nonsingular 
matrix, and C is ann Xx n orthogonal matrix, then 


(i) I— A is idempotent. 
(i) AU — A) = O and I— A)MA= O. 
(iii) P~!AP is idempotent. 
(iv) C’AC is idempotent. (If A is symmetric, C’AC is a symmetric idempotent 
matrix.) 


Theorem 2.13f. Let A be n x p of rank r, let A~ be any generalized inverse of A, 
and let (A’A) be any generalized inverse of A’A. Then A A,AA’, and 
A(A‘A) A’ are all idempotent. 


Theorem 2.13g. Suppose that the n x n symmetric matrix A can be written as 


A= ee A; for some k, where each A; is an n x n symmetric matrix. Then any 
two of the following conditions implies the third condition. 


(i) A is idempotent. 
(ii) Each of Ay, Ao,..., Ax is idempotent. 
(iii) A;Aj = O fori F j. 


Theorem 2.13h. If I = xy Aj, Where each n x n matrix A; is symmetric of rank r;, 
and if n = sy r;, then both of the following are true: 


(i) Each of A;, Ao,..., Az is idempotent. 
(ii) A;Aj = O fori ¥ j. 


56 MATRIX ALGEBRA 
2.14 VECTOR AND MATRIX CALCULUS 


2.14.1 Derivatives of Functions of Vectors and Matrices 


Let u = f(x) be a function of the variables x, x2,...,Xp) in X = (1, X2,... aa) and 
let Ou/Ox,, Ou/Ox2,..., 0u/Ox, be the partial derivatives. We define Ou/Ox as 


Ou 

Oxy 

Ou 

Ou _ | Gx |. (2.111) 
du 
s 


Two specific functions of interest are u = a’x and u = x’ Ax. Their derivatives with 
respect to x are given in the following two theorems. 


Theorem 2.14a. Let u = a’x = x’a, where a! = (a), a2,...,dp) is a vector of con- 
stants. Then 


Ou O(a'x) ~—-O(x'a) 
Ox Ox Ox 


(2.112) 


PROOF 


Ou — Olay x, + aax2 + +++ + ApXp) 
= =a 
Ox; Ox; 


Thus by (2.111) we obtain 


a 
Ou a2 
Ox : 

ap 


Theorem 2.14b. Let u = x’Ax, where A is a symmetric matrix of constants. Then 


Ou O(x'Ax) © 


Ox = Ox 


2Ax. (2.113) 


2.14 VECTOR AND MATRIX CALCULUS 57 


Proor. We demonstrate that (2.113) holds for the special case in which A is 3 x 3. 
The illustration could be generalized to a symmetric A of any size. Let 


xX} aii 412) a3 a, 

x= x2 and A= aj2  a22— a3 = a}, 
/ 

X3 a13 A23.—— 33 a, 


Then x’Ax = xa + 2x, xX2412 + 2x1x3413 + x32 + 2x2x3a23 + x33, and we have 


O(x'A 
ax = 2x, a1, + 2x2412 + 2x3a43 = 2a\x 
Oxy 

O(xX'A 
(x x) = 2x12 +> 2x72 + 2x3473 => 2a4x 
Oxo 

O(x' Ax 
( ) = 2x1 a3 as 2X2.023 Sie: 2X3.033 = 2a3x. 
Ox3 


Thus by (2.11), (2.27), and (2.111), we obtain 


O(x’ Ax) 
Oi aix 
fa) "A / 1 
BY a ORE 55 | aire | ay: 
Ox Ox2 alx 
O(x’ Ax) 2 
Ox3 
Now let u = f(X) be a function of the variables x1;,x)2,...,Xpp in the p x p 


matrix X, and let (Ou/0x11), (Ou/Ox12),...,(Ou/Oxppy) be the partial derivatives. 
Similarly to (2.111), we define 0u/OX as 


Ou Ou 
Ox11 OX1p 
oe 2.114) 
ox | onli os 
Ou Ou 
OX p1 OX pp 


Two functions of interest of this type are u = tr(XA) and u = In|X| for a positive 
definite matrix X. 


Theorem 2.14c. Let u = tr(XA), where X is ap x p positive definite matrix and A is 
ap X p matrix of constants. Then 


Ou _ O[tr(XA)] 


= ig 
ox at A+ Al — diag A. (2.115) 


58 MATRIX ALGEBRA 


Proor. Note that tr(XA) = 77, 0%, xyajz [see the proof of Theorem 2.11 (ii)]. 
Since xj = xj, [OtrCXA)]/Oxj = aj +ay if i A j, and [Otr(XA)]/Ox; = aj. The 
result follows. 


Theorem 2.14d. Let u = In|X| where X is a p x p positive definite matrix. Then 
Oln|X| 


_—94y-1 _ ga; i 
aX 2X | — diagX"’). (2.116) 


Proor. See Harville (1997, p. 306). See Problem 2.83 for a demonstration that this 
theorem holds for 2 x 2 matrices. 


2.14.2 Derivatives Involving Inverse Matrices and Determinants 


Let A be ann x n nonsingular matrix with elements aj; that are functions of a scalar x. 
We define OA/Ox as the nxn matrix with elements Oaj/Ox. The related 
derivative dA"! / Ox is often of interest. If A is positive definite, the derivative 
(0/0x) log |A| is also often of interest. 


Theorem 2.14e. Let A be nonsingular of order n with derivative OA/Ox. Then 


aa! _, 0A 
Ox oe Ox 


a (2.117) 


Proor. Because A is nonsingular, we have 


ATA=L 
Thus 
OAT! OA 
A+A!—=O0O 
yaaa ae 
Hence 
AW! A 
g A at. 
Ox Ox 
and so 
OAT! OA 
A !'—a"! 
Ox Ox 


Theorem 2.14f. Let A be an n x n positive define matrix. Then 


Olog|A| _ _, OA 
ones =u(a ==), (2.118) 


2.14 VECTOR AND MATRIX CALCULUS 59 


Proor. Since A is positive definite, its spectral decomposition (Theorem 2.12d) can 
be written as CDC’, where C is an orthogonal matrix and D is a diagonal matrix of 
positive eigenvalues, A;. Using Theorem 2.12e, we obtain 


Olog|A| A log[J}_, i 
ox Ox 
_ OY bog di 
- Ox 
1 OA; 
= i Ox 


oD 
= -1 
= u(D m=): 
OA 


dCDC’ 
A! —=cp"'!c —— 
Ox Ox 


ODC' AC 


Now 


a —-lq~w / 
=CD co +35, DC 
aD ac’ ac 
= CD''C'|C—C' + CD — +—DC’ 
C ele Pe +e OE c| 


D / 
D- 12 ce +op'c ne. 
Ox Ox 


Pe Ox 


Using Theorem 2.11(i) and (ii), we have 


OA OD oc’ OC 
-1 —1 / 
u(A =) =«(p ae © Ox +C x): 


Since C is orthogonal, C’C = I which implies that 


OCc'C ,0C JAC 
Ox © Ox e4 Ox © 


and 


OC Oc'C oc ac’ 
/ 1 = 
u(c On Ox \- u(c Oe ane =) 


Thus tr[A~!(0A/0x)] = tr[D~!(0D/x)] and the result follows. 


60 MATRIX ALGEBRA 


2.14.3. Maximization or Minimization of a Function of a Vector 


Consider a function u = f(x) of the p variables in x. In many cases we can find a 
maximum or minimum of u by solving the system of p equations 


Ou 
ae 0 (2.119) 

Occasionally the situation requires the maximization or minimization of the func- 
tion u, subject to gq constraints on x. We denote the constraints as 
hy(x) = 0, ho(x) = 0,..., g(x) = 0 or, more succinctly, h(x) = 0. Maximization 
or minimization of u subject to h(x) = 0 can often be carried out by the method of 
Lagrange multipliers. We denote a vector of g unknown constants (the Lagrange mul- 
tipliers) by W and let y’ = (x’,X’). We then let vy = w+ N’h(x). The maximum or 
minimum of u subject to h(x) = 0 is obtained by solving the equations 


Ov 
—=9 
dy 
or, equivalently 
h 
Ge + a =0 and h(x)=0, (2.120) 
Ox Ox 
where 
Oh, Ohg 
Ox Ox 1 
Oh a ; 
Ox _ : 
* | an Oh, 
OXp OXp 
PROBLEMS 


2.1 Prove Theorem 2.2a. 


7 —3 2 
2.2 Let a=(j 9 


(a) Find A’. 
(b) Verify that (A’)’ = A, thus illustrating Theorem 2.1. 
(c) Find A’A and AA’. 


2 4 1 3 
2.3 Leta=(_j }) ma B= (3 a} 


2.4 


2.5 


2.6 


2.7 


2.8 


PROBLEMS 61 


(a) Find AB and BA. 

(b) Find |A], |B], and |AB|, and verify that Theorem 2.9c holds in this case. 
(c) Find |BA| and compare to |AB]. 

(d) Find (AB) and compare to B’A’. 

(e) Find tr(AB) and compare to tr(BA). 

(f) Find the eigenvalues of AB and of BA, thus illustrating Theorem 2.1 2a. 


1 3 -—4 3-2 5 
Leta=(§ 7 :) and B= (7 9 a 


(a) Find A+ B and A —B. 
(b) Find A’ and B’. 
(c) Find (A + B)' and A’ + B’, thus illustrating Theorem 2.2a(ii). 


Verify the distributive law in (2.15), A(B + C) = AB+ AC. 


—2 5 he2 
Let a= (_5 : 4); B= 3 7), C=]-3 1 
6 —4 2 4 

(a) Find AB and BA. 

(b) Find B+ C, AC, and A(B + C). Compare A(B + C) with AB + AC, thus 
illustrating (2.15). 

(c) Compare (AB)’ with B’A’, thus illustrating Theorem 2.2b. 

(d) Compare tr(AB) with tr(BA) and confirm that (2.87) holds in this case. 


/ 

(e) Let a‘ and a’ be the two rows of A. Find ( ie ) and compare with AB in 
part (a), thus illustrating (2.27). 2 

(f) Let b, and b, be the two columns of B. Find (Ab,;, Ab)) and compare with 
AB in part (a), thus illustrating (2.28). 


3 2 A 1 -l 2 
LettA=|] 6 4 2], B=] -l 1 -2 
12 8 4 —1 1 -2 


(a) Show that AB = O. 
(b) Find a vector x such that Ax = 0. 
(c) What is the rank of A and the rank of B? 


If j is a vector of 1s, as defined in (2.6), show that 


(a) ja =a'j = >>, qj, as in (2.24). 
(b) Aj is acolumn vector whose elements are the row sums of A, as in (2.25). 
(c) j/A is arow vector whose elements are the column sums of A, as in (2.25). 


62 MATRIX ALGEBRA 


2.9 Prove Corollary 1 to Theorem 2.2b; that is, assuming that A, B, and C are 
conformal, show that (ABC) = C’B’A’. 
2.10 Prove Theorem 2.2c. 
2.11 Use matrix A in Problem 2.6 and let 


5 0 0 
b= (5 2) D=|0 3 0 
0 0 6 


Find D,A and ADs, thus illustrating (2.29) and (2.30). 


1 
2.12 LetA= | 4 
7 


omnNNy 
coos 


3 0 0 
6], D= b 0 
9 Oc 


Find DA, AD, and DAD. 


2.13 For y’ = (1, y2, y3) and the symmetric matrix 


express y’Ay in the form given in (2.33). 


5 -1l 3 6 -—2 3 2 -3 
2.144 Let A=|{ —-1 12), B=|{7 1 Oj}, C=] -I 4}, 
3 2 2. —3 5 3 1 
3 3 7 
x=[-l], y={2 1], wa) 
4 


Find the following: 


(a) Bx (h) xy’ 
(b) y'B (i) BB 
(ec) x‘Ax  (j) yz’ 
(d) x’'Cz__(k) zy’ 
(e) x’x a) Vy'y 
(f) x’y (m) C’C 
(g) xx’ 


2.15 Use x, y, A, and B as defined in Problem 2.14. 


(a) Find x+y and x—y. 


PROBLEMS 63 


(b) Find tr(A), tr(B), A + B, and tr(A + B). 
(c) Find AB and BA. 

(d) Find tr(AB) and tr(BA). 

(e) Find |AB| and |BA\. 

(f) Find (AB) and B’A’. 


Using B and x in Problem 2.14, find Bx as a linear combination of the columns 
of B, as in (2.37), and compare with Bx as found in Problem 2.14(a). 


2,3 1 -6 2 1 0 
uta={7 3). B=(5 9s) I= (9 2): 


(a) Show that (AB)' = B’A’ as in (2.26). 

(b) Show that AI = A and that IB = B. 

(c) Find |A]. 

(d) Find Av. 

(e) Find (A~!)~! and compare with A, thus verifying (2.46). 

(f) Find (A’)~! and verify that it is equal to (A7!)' as in Theorem 2.5a. 


Let A and B be defined and partitioned as follows: 


2.7 |! 2: 1 1 0) 
A=1|{3 2] oO], B=|]2 1 Il 2 
1 Ol 1 23 WW 2 


(a) Find AB as in (2.35), using the indicated partitioning. 
(b) Check by finding AB in the usual way, ignoring the partitioning. 


2.19 Partition the matrices A and B in Problem 2.18 as follows: 


De) te 
A=[3) 2 0] =(1,A2), 

fi. 0 “4 

Tes Hae a 8) P 
Beaji2 f- 4-9 = ): 

23.1 3 B, 


Repeat parts (a) and (b) of Problem 2.18. Note that in this case, (2.35) becomes 
AB = ab) + AdBo. 


64 


2.20 


2.21 


2.22 


2.23 


2.24 


2.25 


2.26 


2.27 


2.28 
2.29 
2.30 


MATRIX ALGEBRA 


2 
5 —2 3 
Le a= (5 3 # b= ES 


Find Ab as a linear combination of the columns of A as in (2.37) and check the 


result by finding Ab in the usual way. 


Show that each column of the product AB can be expressed as a linear com- 
bination of the columns of A, with coefficients arising from the corresponding 
column of B, as noted following Example 2.3. 


3 0 2 —2 -l 
LettA=j]1 -1 1], B= 3 1 
2 1 0 1 -l 


Express the columns of AB as linear combinations of the columns of A. 


Show that if a set of vectors includes 0, the set is linearly dependent, as noted 
following (2.40). 


Suppose that A and B are n x n and that AB = O as in (2.43). Show that A 
and B are both singular or one of them is O. 


{2 
[ls 2 >. af 
Let a= (5 ; a B= : ! ; c= (5 oe 4): 


Find AB and CB. Are they equal? What are the ranks of A, B, and C? 


21 
Leta= (4 ; ah B=[0 2 
1 0 


(a) Find a matrix C such that AB = CB. Is C unique? 
(b) Find a vector x such that Ax = 0. Can you do this for B? 


a ih -B 5 
LtA=(4 2 3], x=]|2 
1 20° 1 3 


(a) Find a matrix B # A such that Ax = Bx. Why is this possible? Can A 
and B be nonsingular? Can A — B be nonsingular? 


(b) Find a matrix C # O such that Cx = 0. Can C be nonsingular? 


Prove Theorem 2.5a. 
Prove Theorem 2.5b. 


Use the matrix A in Problem 2.17, and let B = & Wiss 


3 1 
and (AB) !. Verify that Theorem 2.5b holds in this case. 


). Find AB, B~!, 


2.31 


2.32 


2.33 
2.34 
2.35 
2.36 
2.37 
2.38 
2.39 
2.40 


2.41 
2.42 


2.43 
2.44 


2.45 
2.46 


2.47 


PROBLEMS 65 


Show that the partitioned matrix A = Ge ee ) has the inverse indicated 
in (2.50). On Some 
Show that the partitioned matrix A = ts =) has the inverse given in 
(2.51). ies one 


Show that B + ce’ has the inverse indicated in (2.53). 
Show that A + PBQ has the inverse indicated in (2.54). 
Show that y/Ay = y’[4(A + A’)]y as in (2.55). 

Prove Theorem 2.6b(ii). 


Prove Corollaries 1 and 2 of Theorem 2.6b. 


Prove the “only if’ part of Theorem 2.6c. 
Prove Corollary 1 to Theorem 2.6c. 


Compare the rank of the augmented matrix with the rank of the coefficient 
matrix for each of the following systems of equations. Find solutions where 
they exist. 


(a) XxX, + 2x. + 3x3 = 6 (b) Xy — xX. + 2x3 = 2 
XxX} —X%=2 xX) —%.—-x%3=-1 
x, —-%=-1 2x, — 2x. +x3 = 2 


(c) Xy 1X2 X37 X= 8 


Xx, — X22 — x3 — xy = 6 


3x) TX. X31 NS 22 
Prove Theorem 2.8a. 


For the matrices A, Ay, and A; in (2.59) and (2.60), show that AA; A= A 
and AAZA= A. 


Show that A; in (2.60) can be obtained using Theorem 2.8b. 


Show that A, in (2.60) can be obtained using the five-step algorithm follow- 
ing Theorem 2.8b. 


Prove Theorem 2.8c. 


Show that if A is symmetric, there exists a symmetric generalized inverse for 
A, as noted following Theorem 2.8c. 


4 2 2 
LettA=|2 2 0 
2 0 2 


(a) Find a symmetric generalized inverse for A. 


66 


2.48 


2.49 
2.50 


2.51 


2.52 
2.53 
2.54 


2.55 
2.56 
2.57 


2.58 
2.59 
2.60 
2.61 


2.62 
2.63 


MATRIX ALGEBRA 


(b) Find a nonsymmetric generalized inverse for A. 
(a) Show that if A is nonsingular, then AT = AT, 


(b) Show that if A is n x p of rank p <n, then A” is a “left inverse” of A, 
that is, A A=I. 


Prove Theorem 2.9a parts (iv) and (vi). 


Use A = Gi ) from Problem 2.17 to illustrate (64), (2.66), and (2.67) in 


Theorem 2.9a. 


(a) Multiply A in Problem 2.50 by 10 and verify that (2.69) holds in this case. 
(b) Verify that (2.69) holds in general. 


Prove Corollaries 1, 2, 3, and 4 of Theorem 2.9b. 
Prove Corollaries 1 and 2 of Theorem 2.9c. 


Use A in Problem 2.50 and let B = & “ 


(a) Find |A], |B], AB, and |AB| and illustrate (2.77). 

(b) Find |A|? and |A?| and illustrate (2.79). 

Use Theorem 2.9c and Corollary | of Theorem 2.9b to prove Theorem 2.9b. 
Show that if C’C = I, then CC’ = [as in (2.84). 


The columns of the following matrix are mutually orthogonal: 
-1 1 
A=|{-1l 0 2 
1 1 


(a) Normalize the columns of A by dividing each column by its length; 
denote the resulting matrix by C. 


(b) Show that C’'C = CC’ =I. 

Prove Theorem 2.10a. 

Prove Theorem 2.11 parts (i), (iv), (v), and (vii). 

Use matrix B in Problem 2.26 to illustrate Theorem 2.11 parts (iii) and (iv). 
Use matrix A in Problem 2.26 to illustrate Theorem 2.11(v), that is, 
tr(A’A) = tr(AA’) = i Gi. 

Show that tr(A~ A) = tr(AA~) = r = rank(A), as in (2.93). 


Use A in (2.59) and A; in (2.60) to illustrate Theorem 2.11 (viii), that is, 
tr(A” A) = tr(AA_) = r = rank(A). 


PROBLEMS 67 


Obtain x) = (2//5, 1/5)’ in Example 2.12.1. 
For k = 3, show that A‘x = A*x as in (2.100). 


Show that limy_,.0 AX = O in (2.102) if A is symmetric and if all eigenvalues 
of A satisfy -1 <A <1. 


Prove Theorem 2.12a. 
Prove Theorem 2.12b. 


Prove Theorem 2.12c(ii) for the case where the eigenvalues A;, A2,..., Ay are 
distinct. 


Prove Corollary 1 to Theorem 2.12d. 
3 1 1 

LettA=]1 0 2 
1 2 0 


(a) The eigenvalues of A are 1, 4, —2. Find the normalized eigenvectors and 
use them as columns in an orthogonal matrix C. 


(b) Show that A = CDC’, as in (2.103), where D = diag(1,4, —2). 
(c) Show that C’AC = D as in (2.106). 


Prove Theorem 2.12e¢ for a symmetric matrix A. 


1 1 -—2 
LettA=]-l1 2 1 
0 1 -l 


(a) Find the eigenvalues and associated normalized eigenvectors. 
(b) Find tr(A) and |A| and verify that tr(A) = S>7_, A; and |A| = T]?_, Ai, as 
in Theorem 2.12e. 


Prove Theorem 2.12f(ii). 


1 0 -l 
Let A= 0 1 -l 
—-1 -l 3 


(a) Show that |A| > 0. 
(b) Find the eigenvalues of A. Are they all positive? 


Let A!/? be defined as in (2.109). 


68 


2.77 


2.78 
2.79 


2.80 


2.81 


2.82 


2.83 


2.84 


MATRIX ALGEBRA 


(a) Show that Al’? is symmetric. 
(b) Show that (A!/?)? = A as in (2.110). 


For the positive definite matrix A = ( ae ) , calculate the eigenvalues 


—1 2 
and eigenvectors and find the square root matrix A!/? as in (2.109). Check 
by showing (Al/?)? = A. 


Prove Theorem 2.13e. 


Prove Theorem 2.13f. 


Let A = 


oro 


2 
3 
0 
v2 
3 


iS) 
wile ous 


(a) Find the rank of A. 

(b) Show that A is idempotent. 
(c) Show that I — A is dempotent. 
(d) Show that A(I — A) = O. 

(e) Find tr(A). 

(f) Find the eigenvalues of A. 


Consider a p x p matrix A with eigenvalues A;,A2,...,Ap. Show that 
[u(A)P = tr(A?) + 29 0,4, AiAj- 


Consider a nonsingular n x n matrix A whose elements are functions of the 
scalar x. Also consider the full-rank p x n matrix B. Let H = B(BAB’) |B. 
Show that 


OH OA 

— = —-H—H. 

Ox Ox 
Show that 

Oln|X 
nr | ox"! — diag x"! 
for a 2 x 2 positive definite matrix X. 10 0 
Let wu = x’Ax where x is a 3 x 1 vector and A= | 0 2 O |. Use the 
0 0 3 


Lagrange multiplier method to find the vector x that minimizes u subject to 
the constraints x; + x2 = 2, and x2 + x3 = 3. 


3 Random Vectors and Matrices 


3.1 INTODUCTION 


As we work with linear models, it is often convenient to express the observed data 
(or data that will be observed) in the form of a vector or matrix. A random vector 
or random matrix is a vector or matrix whose elements are random variables. 
Informally, a random variable is defined as a variable whose value depends on the 
outcome of a chance experiment. (Formally, a random variable is a function 
defined for each element of a sample space.) 

In terms of experimental structure, we can distinguish two kinds of random vectors: 


1. A vector containing a measurement on each of n different individuals or exper- 
imental units. In this case, where the same variable is observed on each of n 
units selected at random, the n random variables yj, yo, ..., y, in the vector are 
typically uncorrelated and have the same variance. 


2. A vector consisting of p different measurements on one individual or exper- 
imental unit. The p random variables thus obtained are typically correlated 
and have different variances. 

To illustrate the first type of random vector, consider the multiple regression model 


Yi = Bo t+ Byxin + Boxi2 +--+ + ByxXiK + Ei, i=1,2,...,n, 


as given in (1.2). In Chapters 7—9, we treat the x variables as constants, in which case 
we have two random vectors: 


J1 €] 
y2 EQ 

y= : and €= colle (3.1) 
Yn En 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


69 


70 RANDOM VECTORS AND MATRICES 


The y; values are observable, but the ¢;’s are not observable unless the #’s are 
known. 

To illustrate the second type of random vector, consider regression of y on several 
random x variables (this regression case is discussed in Chapter 10). For the ith indi- 
vidual in the sample, we observe the k + 1 random variables y;, x;1, Xj2, ... 5 Xix, Which 
constitute the random vector (y;, x;1,...,x;)'. In some cases, the k + 1 variables 
Vip Xi1> +--+» Xjz are all measured using the same units or scale of measurement, but 
typically the scales differ. 


3.2 MEANS, VARIANCES, COVARIANCES, AND CORRELATIONS 


In this section, we review some properties of univariate and bivariate random vari- 
ables. We begin with a univariate random variable y. We do not distinguish notation- 
ally between the random variable y and an observed value of y. In many texts, an 
uppercase letter is used for the random variable and the corresponding lowercase 
letter represents a realization of the random variable, as in the expression P(Y < y). 
This practice is convenient in a univariate context but would be confusing in 
the present text where we use uppercase letters for matrices and lowercase letters 
for vectors. 

If f(y) is the density of the random variable y, the mean or expected value of y is 
defined as 


foe} 


p= EQ) = | f(y) dy. (3.2) 


This is the population mean. Later (beginning in Chapter 5), we also use the sample 
mean of y, obtained from a random sample of n observed values of y. 

The expected value of a function of y such as y” can be found directly without first 
finding the density of o. In general, for a function u(y), we have 


CO 


Eluy)] = | u(y)f() dy. (3.3) 


For a constant a and functions u(y) and v(y), it follows from (3.3) that 
E(ay) = aE(y), (3.4) 


E[u(y) + v(y)] = Elu(y)] + E[o(y)]. (3.5) 


The variance of a random variable y is defined as 


o° = var(y) = E(y — p)’, (3.6) 


3.2. MEANS, VARIANCES, COVARIANCES, AND CORRELATIONS 71 


This is the population variance. Later (beginning in Chapter 5), we also use the 
sample variance of y, obtained from a random sample of n observed values of y. 
The square root of the variance is known as the standard deviation: 


o = \/var(y) = 1/ EQ — p). (3.7) 


Using (3.4) and (3.5), we can express the variance of y in the form 


o = var(y) = E(y*) — p’. (3.8) 


If a is a constant, we can use (3.4) and (3.6) to show that 
var(ay) = a’var(y) =a’. (3.9) 


For any two variables y; and y, in the random vector y in (3.1), we define the 
covariance as 


oy = Cov(yi, yj) = ELQi — BJO; — BI, (3.10) 
where 1; = E(y;) and yj; = E (y;). Using (3.4) and (3.5), we can express o;; in the form 
Oj = COV(Y;, yj) = EQ) — Mi b;- (3.11) 


Two random variables y; and y, are said to be independent if their joint density 
factors into the product of their marginal densities 


Sis Yi) = filY) AO) (3.12) 


where the marginal density f;(y;) is defined as 
fod =| fonsddry G.13) 


From the definition of independence in (3.12), we obtain the following properties: 
1. EQ, yj) = EQiEG,) if y; and y; are independent. (3.14) 
2. Oj = COVv(y;, yj) = 0 if y; and y; are independent. (3.15) 


The second property follows from the first. 


72 RANDOM VECTORS AND MATRICES 


y 
y=14+2x-x? 
y=2x- x 
0 1 2 P 
Figure 3.1 Region for f(x, y) in Example 3.2. 
In the first type of random vector defined in Section 3.1, the variables y;, yz, ...,Y, 


would typically be independent if obtained from a random sample, and we would 
thus have oj; =0 for all i 4 j. However, for the variables in the second type of 
random vector, we would typically have o,; 4 0 for at least some values of i and j. 

The converse of the property in (3.15) is not true; that is, oj; = 0 does not imply 


independence. This is illustrated in the following example. 


Example 3.2. Suppose that the bivariate random variable (x, y) is distributed uni- 
formly over the region 0 < x < 2, 2x — x <y<1+2x- x7; see Figure 3.1. 
The area of the region is given by 


2 pl4+2x- 
Area = | | dy dx = 2. 
0 J2x—x? 


Hence, for a uniform distribution over the region, we set 
fy) =35, O0<x<2, 2x—rP<y<14+2r-2, 


so that { [ f(x, y)dx dy = 1. 
To find o,, using (3.11), we need E(xy), E(x), and E(y). The first of these is 
given by 


2 pl+2x—2? 
Bun) =| | xy(4) dy dx 


2% 7 
[3c + 4x )dx 6 


3.2. MEANS, VARIANCES, COVARIANCES, AND CORRELATIONS 73 


To find E(x) and E(y), we first find the marginal distributions of x and y. For f(x), 
we have, by (3.13), 


For f(y), we obtain different results forO << y< 1 and 1<y<2: 


1-/l-y 2 


A= | jar+ | kdx=1-VJ1-y, O<y<1l, (@.16) 
0 14+-/I-y 
1+/2-y 
Av = | sdx = /2-y, l<y<2. (3.17) 
1-/2-y 
Then 
2 
E(x) -| x($)dx = 1, 
0 


1 2 
By) = | yl- vil =yidy + | yf2—y dy=§%. 
0 1 
Now by (3.11), we obtain 


On = E(xy) = EQ@)E() 
=1-Q =0 


However, x and y are clearly dependent since the range of y for each x depends on the 
value of x. 

As a further indication of the dependence of y on x, we examine E(y|x), the 
expected value of y for a given value of x, which is found as 


E(y|x) = [or (y|x)dy. 


The conditional density f(y|x) is defined as 


f@Y) 
fi’ 


fol) = (3.18) 


74 RANDOM VECTORS AND MATRICES 


y=l+2x-x* 


E(y|x) = 5 (1+. 4x— 22%) 


y=2x-x? 


0 1 2 x 
Figure 3.2 E(y|x) in Example 3.2. 


which becomes 


1 
foW=-= ti i ee eS ae 
2 
Thus 
142x—x7 
E(y|x) = | _ yDdy 
2x-x: 


=1(1+4x— 222). 


Since E(y|x) depends on x, the two variables are dependent. Note that 
EQ|x) = 5(1 +4x— 2x) is the average of the two curves y = 2x — x* and y= 
1 + 2x — x. This is illustrated in Figure 3.2. 


In Example 3.2 we have two dependent random variables x and y for which o,,, = 0. 
In cases such as this, 0; is not a good measure of relationship. However, if x and y have 
a bivariate normal distribution (see Section 4.2), then o;,, = 0 implies independence of 
x and y (see Corollary 1 to Theorem 4.4c). In the bivariate normal case, E(y|x) is a 
linear function of x (see Theorem 4.4d), and curves. such as 
EQ|x) = (1 + 4x — 2x*) do not occur. 

The covariance oj; as defined in (3.10) depends on the scale of measurement of 
both y, and y;. To standardize oj, we divide it by (the product of) the standard devi- 
ations of y,; and y, to obtain the correlation: 


O77 
Pi = corr(yj, yj) = d (3.19) 


OjO;- 


3.3. MEAN VECTORS AND COVARIANCE MATRICES FOR RANDOM VECTORS 75 


3.3. MEAN VECTORS AND COVARIANCE MATRICES FOR 
RANDOM VECTORS 


3.3.1 Mean Vectors 


The expected value of a p x 1 random vector y is defined as the vector of expected 


values of the p random variables y;, y2,..., yp, in y: 
YI E(y1) My 
y2 E(y2) My 
E(y)=E| . | = ; =|. | =-B (3.20) 
Yp E(yp) Ky 


where E(y;) = p; is obtained as E(y;) = Jy Fix) dy; using f(y), the marginal 
density of y;. 

If x and y are p x | random vectors, it follows from (3.20) and (3.5) that the 
expected value of their sum is the sum of their expected values: 


E(x + y) = E(x) + Ey). (3.21) 


3.3.2 Covariance Matrix 


The variances Oi, o, sdodet Op of y1, y2,-..5¥p and the covariances oj for all i 4 j 
can be conveniently displayed in the covariance matrix, which is denoted by &, 
the uppercase version of oj: 


O11 O12 Olp 
O21 022 ay ase O2p 

Y = cov(y) = ; : 4d. (3.22) 
Op! Op2 «+. Opp 


The ith row of } contains the variance of y; and the covariance of y; with each of the 
other y variables. To be consistent with the notation o;;, we have used o;; = os, i= 1, 
2,...,p, for the variances. The variances are on the diagonal of }, and the covari- 
ances occupy off-diagonal positions. There is a distinction in the font used for } 
as the covariance matrix and )> as the summation symbol. Note also the distinction 
in meaning between the notation cov(y) = & and cov(y,, Yj) = Oy. 

The covariance matrix & is symmetric because 0 = 0;; [see (3.10)]. In many appli- 
cations, } is assumed to be positive definite. This will ordinarily hold if the y variables 
are continuous random variables and if there are no linear relationships among them. 
(If there are linear relationships among the y variables, } will be positive semidefinite.) 


76 RANDOM VECTORS AND MATRICES 


By analogy with (3.20), we define the expected value of a random matrix Z as the 
matrix of expected values: 


Zur Z12. es Zp E(@i1) E(@i2) ... ECZip) 
221 222 «es Zp E(@21) E(@22)_ «.. Elza) 

E(Z)=E| | be (3.23) 
nl <n2 +++) Snp E(Zn1) E(Zn2) steel E(Znp) 


We can express & as the expected value of a random matrix. By (2.21), the (ij)th 


element of the matrix (y—p)(y— py is (y;— wD yj— m,). Thus, by (3.10) and (3.23), 
the (ij)th element of E [(y— 1) (y— p)'] is El (yi— ma) (yj — )] = oy Hence 


O11 O12 Olp 
O21 022 mete O2p 
El(y — wy — w= . SW > (3.24) 
Trl Fp? opp 
We illustrate (3.24) for p = 3: 
Y= Ely — py — py] 
yYi— By 
=E! | yo- by | O1— 1 Y2 — Ma, 3 — V3) 
¥3 — M3 
(yi — by) (v1 — B)O2 = Hy) (1 — H)O3 — Bs) 
= E} (y2 — fy)O1 — By) (y2 — by)? (yo — My )(3 — M3) 
(y3 — H3)O1 — By) 03 — Ha) — Be) (y3 — Hs)” 
E(y1 — by)? El(y1 — &)O2 — &)) ElQ1 — BDO3 — B3)] 
= | El(y2 — u)01 — )I E(y2 — py)? El(y2 — By)(y3 — b3)I 
El(y3 — 63)01 — My) ELO3 — 3 )0%2 — M)] E(y3 — My)” 


7% on O13 


021 o 023 


O31 07 OF 


We can write (3.24) in the form 


> = El(— wy — pw) = Eyy’) — we, (3.25) 


which is analogous to (3.8) and (3.11). 


3.4 CORRELATION MATRICES 77 


3.3.3. Generalized Variance 


A measure of overall variability in the population of y variables can be defined as the 
determinant of ©: 


Generalized variance = |]. (3.26) 


If || is small, the y variables are concentrated closer to yx than if || is large. A small 
value of || may also indicate that the variables y,, y2,...,y, in y are highly inter- 
correlated, in which case the y variables tend to occupy a subspace of the p dimen- 
sions [this corresponds to one or more small eigenvalues; see Rencher (1998, Section 
2.1.3)]. 


3.3.4 Standardized Distance 


To obtain a useful measure of distance between y and yp, we need to account for the 
variances and covariances of the y; variables in y. By analogy with the univariate stan- 
dardized variable (y — y4)/o, which has mean 0 and variance 1, the standardized dis- 
tance is defined as 


Standardized distance = (y — p)'>'(y — p). (3.27) 


The use of &~! standardizes the (transformed) y; variables so that they have means 
equal to 0 and variances equal to | and are also uncorrelated (see Problem 3.11). 
A distance such as (3.27) is often called a Mahalanobis distance (Mahalanobis 1936). 


3.4 CORRELATION MATRICES 


By analogy with & in (3.22), the correlation matrix is defined as 


1 py +++ Pip 

Pr Lee Pap 
(oe eas (3.28) 

Ppl Ppr ++ 1 


where pj = Oi /0;0; is the correlation of y; and y; defined in (3.19). The second row 
of P,, for example, contains the correlation of yz with each of the other y variables. 
We use the subscript p in P, to emphasize that P is the uppercase version of p. 

If we define 


D,, = [diag()]!? = diag(oi, 02, ..., op), (3.29) 


78 RANDOM VECTORS AND MATRICES 


then by (2.31), we can obtain P, from > and vice versa: 


Fo= Dp =De. (3.30) 


Y= D,P,D.. (3.31) 


3.55 MEAN VECTORS AND COVARIANCE MATRICES FOR 
PARTITIONED RANDOM VECTORS 


Suppose that the random vector v is partitioned into two subsets of variables, which 
we denote by y and x: 


J 


Thus there are p + g random variables in v. 
The mean vector and covariance matrix for v partitioned as above can be expressed 
in the following form 


= —AY\. (EM) (& 
n-an=e()=(58)=(t) om 


*. _ x}. Lyy Bye 
~ = cov(v) = cov(*) - es =) (3.33) 


where X= Si: In (3.32), the submatrix mp = [E(y1),E(y2), ---,EQ,)] 
contains the means of y\,y2,...,Y,. Similarly gs, contains the means of the x 
variables. In (3.33), the submatrix 2, = cov(y) is a p x p covariance matrix for y 
containing the variances of y,, y2,..., y, on the diagonal and the covariance of 


3.6 LINEAR FUNCTIONS OF RANDOM VECTORS 79 


each y; with each y; (i # j) off the diagonal: 


o, Fyiy. °°" yyy 
Pypy Oi "tt Oyoy, 
Xyy i : , 
Py,y, Py, yo ps o. 
Similarly, ¥,, = cov(x) is the g x q covariance matrix of x1, X2,... ,Xq. The matrix 


2%), in (3.33) is p x g and contains the covariance of each y; with each x;: 


Dyix, Dy, 7° Dyx, 

Dy x,  Fygxy °°"  Dynxy 
pe cam 

Dy, xy Oy,x2 yee Dy, xq 


Thus %,, is rectangular unless p = q. The covariance matrix %,, is also denoted 
by cov(y, x) and can be defined as 


Xyx = cov(y, x) = El(y — m,)(x — my)’. (3.34) 


Note the difference in meaning between cov( in (3.33) and cov(y, x) = Xx 


in (3.34). We have now used the notation cov in three ways: (1) cov(y,, y;), (2) 
cov(y), and (3) cov(y, x). The first of these is a scalar, the second is a symmetric 
(usually positive definite) matrix, and the third is a rectangular matrix. 


3.6 LINEAR FUNCTIONS OF RANDOM VECTORS 


We often use linear combinations of the variables y,,y2,...,y¥, from a random 
vector y. Let a= (aj, a2, ..., dp) be a vector of constants. Then, by an expression 
preceding (2.18), the linear combination using the a terms as coefficients can be 
written as 


Z= ay) + any2 +-+++ apy, =a'y. (3.35) 


We consider the means, variances, and covariances of such linear combinations in 
Sections 3.6.1 and 3.6.2. 


80 RANDOM VECTORS AND MATRICES 


3.6.1 Means 
Since y is a random vector, the linear combination z = a’y is a (univariate) random 


variable. The mean of a’y is given the following theorem. 


Theorem 3.6a. If a is a p x 1 vector of constants and y is a p x 1 random vector 
with mean vector p, then 


b, = E(a'y) = a’E(y) = a'p. (3.36) 


Proor. Using (3.4), (3.5), and (3.35), we obtain 


E(a'y) = E(ayy1 + ary. + +++ + apyp) 
= E(ayy1) + E(any2) + +++ + E@pyp) 
= ayE(y1) + a2E(y2) + +++ + apE(yp) 
E(y1) 
E(y2) 


= (a1, d2,..-, Ap) 


E(yp) 


= a'E(y) =a'p. 


Suppose that we have several linear combinations of y with constant coefficients: 


/ 
Z1 = 411 + a2y2 +... + AipYp = AY 

/ 
22 = 4211 + A22Y2 +... + G2pYp = AY 


Zk = Aer Y1 + Agay2 +... + AkpYp = ALY, 


where a) = (ai, 42, ... , ip) and y = (y1, ya, --- 5¥p) These k linear functions can 
be written in the form 


z= Ay, (3.37) 


3.6 LINEAR FUNCTIONS OF RANDOM VECTORS 81 


where 


ZI a, aii a2 A1p 
/ 
22 a, a2, a22 2p 
Z= . A= : — 
/ 
Zk a; Aki aga «+» Akp 


It is possible to have k >p, but we typically have k <p with the rows of A 
linearly independent, so that A is full-rank. Since y is a random vector, each 
z; = aly is a random variable and z= (z,Z,... »ze) is a random vector. The 
expected value of z= Ay is given in the following theorem, as well as some 
extensions. 


Theorem 3.6b. Suppose that y is a random vector, X is a random matrix, a and b are 
vectors of constants, and A and B are matrices of constants. Then, assuming the 
matrices and vectors in each product are conformal, we have the following expected 
values: 


(i) E(Ay) = AE(y). (3.38) 
(ii) E(a'Xb) = a'E(X)b. (3.39) 
(iii) E(AXB) = AE(X)B. (3.40) 


Proor. These results follow from Theorem 3.6A (see Problem 3.14). 


Corollary 1. If A is ak x p matrix of constants, b is ak x 1 vector of constants, and 
y is ap X 1 random vector, then 


E(Ay +b) = AE(y) +b. (3.41) 


3.6.2 Variances and Covariances 


The variance of the random variable z= a’y is given in the following theorem. 


Theorem 3.6c. If a is a p x | vector of constants and y is a p x | random vector 
with covariance matrix &, then the variance of z = a’y is given by 


o- = var(a’y) = a/Xa. (3.42) 


82 RANDOM VECTORS AND MATRICES 


Proor. By (3.6) and Theorem 3.6a, we obtain 


var(a'y) = E(a'y — a’ py? = Ela'(y — pw) 
= Efa'(y — pwa'(y — »)] 
= Ela(y—w(y- pal [by (2.18)] 
=aEl(y — p(y — p)'Ja [by Theorem 3.6b(ii)] 
=a'da [by(3.24)]. 


We illustrate 3.42 for p = 3: 


var(a’y) = var(ajy; + azy2 + a3y3) = ada 
= ayo; + a505 + a,0% + 2a1a2012 + 24143013 + 24203093. 


Thus, var(a’y) = a’a involves all the variances and covariances of y;, y2, and y3. 
The covariance of two linear combinations is given in the following corollary to 
Theorem 3.6c. 


Corollary 1. If a and b are p x 1 vectors of constants, then 


cov(a’y, b’y) = a’=b. (3.43) 


Each variable z; in the random vector z = (zi, Z2,..., 2%)’ = Ay in (3.37) has a 
variance, and each pair z; and z; (i # j) has a covariance. These variances and covari- 
ances are found in the covariance matrix for z, which is given in the following 
theorem, along with cov(z, w), where w = By is another set of linear functions. 


Theorem 3.6d. Let z = Ay and w = By, where A is ak x p matrix of constants, B 


is an m X p matrix of constants, and y is a p x | random vector with covariance 
matrix }. Then 


(i) cov(z) = cov(Ay) = AXA’, (3.44) 


(ii) cov(z, w) = cov(Ay, By) = AXB’. (3.45) 


Typically, k < p, and the k x p matrix A is full rank, in which case, by Corollary | 
to 2.6b, AXA’ is positive definite (assuming & to be positive definite). If k > p, then 
by Corollary 2 to Theorem 2.6b, AXA’ is positive semidefinite. In this case, AXA’ is 
still a covariance matrix, but it cannot be used in either the numerator or denominator 
of the multivariate normal density given in (4.9) in Chapter 4. 


PROBLEMS 83 
Note that cov(z, w) = A2B’ is ak x m rectangular matrix containing the covari- 
ance of each z; with each w,, that is, cov(z, w) contains cov(z;,w;),i= 1, 2,...,k, 


j=1,2,...,m. These km covariances can also be found individually by (3.43). 


Corollary 1. If b is ak x 1 vector of constants, then 


cov(Ay + b) = AXA’. (3.46) 


The covariance matrix of linear functions of two different random vectors is given 
in the following theorem. 


Theorem 3.6e. Let y be ap x 1 random vector and x be ag x | random vector such 
that cov(y, x) = %),. Let A be ak x p matrix of constants and B be an h x q matrix 
of constants. Then 


cov(Ay, Bx) = Ad,,B’. (3.47) 


Proor. Let 


Use Theorem 3.6d(i) to obtain cov(Cv). The result follows. 


PROBLEMS 


3.1 Show that E(ay) = aE(y) as in (3.4). 

3.2 Show that E(y — mw)’ = E(y”) — p? as in (3.8). 

3.3 Show that var(ay) = ao” as in (3.9). 

3.4 Show that cov(y;, yj) = E(yiyj) — wim; as in (3.11). 

3.5 Show that if y, and y; are independent, then E(y;yj) = EQ) EQ) as in (3.14). 
3.6 Show that if y; and y; are independent, then oj = 0 as in (3.15). 

3.7 Establish the following results in Example 3.2: 


(a) Show that fo(y)=1-— /1l—y forO<y< 1 and fp(Qy) = /2—-y for 
l<y<2. 


(b) Show that E(y) = z and E(xy) = Z. 
(c) Show that E(y|x) = 5 + 4x — 2x"). 


84 


3.8 


3.9 
3.10 
3.11 


3.12 
3.13 


3.14 
3.15 
3.16 
3.17 
3.18 
3.19 


3.20 


RANDOM VECTORS AND MATRICES 


Suppose the bivariate random variable (x, y) is uniformly distributed over the 
region bounded below by y=x—1 for | <x<2 and by y=3-—x for 
2 <x <3 and bounded above by y = x for 1 <x < 2 and by y= 4 — x for 
2<x<3. 


(a) Show that the area of this region is 2, so that f(x, y) = . 
(b) Find fi), f(y), EQ), E(y), E(xy), and ox, as was done in Example 3.2. 
Are x and y independent? 


(c) Find f(y|x) and E(y|x). 
Show that E(x + y) = E(x) + E(y) as in (3.21). 
Show that E[(y — w)(y — )'] = Evyy’) — pop’ as in (3.25). 


Show that the standardized distance transforms the variables so that they are 
uncorrelated and have means equal to 0 and variances equal to | as noted 
following (3.27). 


Illustrate P, = D;'%D;' in (3.30) for p = 3. 
Using (3.24), show that 


cov(v) = cov ( y) a ( = = ) 
xy XX 
as in (3.33). 


Prove Theorem 3.6b. 
Prove Corollary 1 to Theorem 3.6b. 
Prove Corollary 1 to Theorem 3.6c. 
Prove Theorem 3.6d. 
Prove Corollary 1 to Theorem 3.6d. 


Consider four k x | random vectors y, x, v, and w, and four h x k constant 
matrices A, B, C, and D. Find cov(Ay + Bx, Cv + Dw). 


Let y = (¥1, y2, y3)’ be a random vector with mean vector and covariance 
matrix 


3.21 


PROBLEMS 85 


(a) Let z = 2y, — 3y2 + y3. Find E(z) and var(z). 
(b) Let z} = y; +y2 + y3 and z = 3y; + y2 — 2y3. Find E(z) and cov(z), 
where z = (z1, 22)’. 


Let y be a random vector with mean vector and covariance matrix p and & as 
given in Problem 3.20, and define w = (w1, wW2, w3)' as follows: 


wy = 2y1 — y2 + y3 
Wz = y, + 2y2 — 3y3 
w3 = yy + y2 + 2y3. 


(a) Find E(w) and cov(w). 
(b) Using z as defined in Problem 3.20b, find cov(z, w). 


4 Multivariate Normal Distribution 


In order to make inferences, we often assume that the random vector of interest has a 
multivariate normal distribution. Before developing the multivariate normal density 
function and its properties, we first review the univariate normal distribution. 


4.1 UNIVARIATE NORMAL DENSITY FUNCTION 


We begin with a standard normal random variable z with mean 0 and variance 1. We 
then transform z to a random variable y with arbitrary mean mw and variance o7, and 
we find the density of y from that of z. In Section 4.2, we will follow an analogous 
procedure to obtain the density of a multivariate normal random vector. 

The standard normal density is given by 


el _w<z<0, (4.1) 


1 
g(Z) = Nor 


for which E(z) = 0 and var(z) = 1. When z has the density (4.1), we say that z is dis- 
tributed as NM(O, 1), or simply that z is N(O,1). 

To obtain a normal random variable y with arbitrary mean ps and variance a”, we 
use the transformation z=(y— p)/o or y=oz+p, so that E(y)= yp and 
var(y) = a”. We now find the density f(y) from g(z) in (4.1). For a continuous 
increasing function (such as y = oz+ py) or for a continuous decreasing function, 
the change-of-variable technique for a definite integral gives 


FOS 2Q) os (4.2) 


dz 
dy 


where |dz/dy| is the absolute value of dz/dy (Hogg and Craig 1995, p. 169). To use 
(4.2) to find the density of y, it is clear that both z and dz/dy on the right side must be 
expressed in terms of y. 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


87 


88 MULTIVARIATE NORMAL DISTRIBUTION 


Let us apply (4.2) to y= oz-+uw. The density g(z) is given in (4.1), and for 
z= (y— p)/o, we have |dz/dy| = 1/o. Thus 


_ e—)- 
=g 
Oo Oo 
1 


= wpe (4.3) 


V270 


£0) = a) 


dz 
dy 


which is the normal density with E(y) = p and var(y) = 07. When y has the density 
(4.3), we say that y is distributed as N(p, o-) or simply that y is N(w, Oo). 

In Section 4.2, we use a multivariate extension of this technique to find the multi- 
variate normal density function. 


4.2 MULTIVARIATE NORMAL DENSITY FUNCTION 


We begin with independent standard normal random variables z), z2, ... , Zp, with 
bi; = 0 and a? = | for all i and oj, = 0 for i 4 j, and we then transform the z/s to 
multivariate normal variables y), y2,..., yp), with arbitrary means, variances, and 
covariances. We thus start with a random vector z= (zj,Z,..., Sas where 
E(z) = 0, cov(z) = I, and each z; has the M(0,1) density in (4.1). We wish to trans- 
form z to a multivariate normal random vector y = (yi, y2.--, yp)’ with E(y) = pw 
and cov(y) = &, where p is any p x 1 vector and & is any p x p positive definite 
matrix. 
By (4.1) and an extension of (3.12), we have 


G(Z1, 225 +++ » Zp) = BZ) = 81(21)82(Z2) - + Bp(Zp) 


he Oh os ot 2 ee 


Sige Ion Jon 


— | e Da 
(/2ay 


1 J 
Sa e 77/2 [by (2.20)]. (4.4) 


If z has the density (4.4), we say that z has a multivariate normal density with mean 
vector 0 and covariance matrix I or simply that z is distributed as N,,(0, I), where p is 
the dimension of the distribution and corresponds to the number of variables in y. 

To transform z to y with arbitrary mean vector E(y) = mp and arbitrary (positive 
definite) covariance matrix cov(y) = &, we define the transformation 


ya ct, (4.5) 


4.2) MULTIVARIATE NORMAL DENSITY FUNCTION 89 


where >! > is the (symmetric) square root matrix defined in (2.109). By (3.41) and 
(3.46), we obtain 


Ey) = E272 + p) =3'7E@) + w= 2'04+ p= m, 


cov(y) = cov(/?z 4+ w= YP cov(zy(el/7y =S pp? = 5. 


Note the analogy of (4.5) to y= o0z+ wp in Section 4.1. 

Let us now find the density of y = wert p from the density of z in (4.4). By 
(4.2), the density of y= o0z+ p is f(y) = g(2)|dz/dy = g(z)|1/o|. The analogous 
expression for the multivariate linear transformation y = wert pris 


fly) = g@)abs(\=~"/)), (4.6) 


where 3%!’ is defined as cel! >)-! and abs({=~!/ *1) represents the absolute value of 


1/2 
> 


the determinant of >” which parallels the absolute value expression 


|dz/dy| = |1/o| in the univariate case. (The determinant [¥~'/?| is the Jacobian of 


the transformation; see any advanced calculus text.) Since Sis positive definite, 
we can dispense with the absolute value and write (4.6) as 


fy) = g@|e | (4.7) 
= g@\X\/? [by (2.67)]. (4.8) 


In order to express z in terms of y, we use (4.5) to obtain z = yy *1y — p). Then 
using (4.4) and (4.8), we can write the density of y as 


1 1 
fly) = g(z)|d|- 7 Spe eo tt/2 
(V2m) ||" 

- = ee? y— wey w)1/2 

(/27yP|> : 
= ae eo By EPRI) y—p)/2 

(/27) | He 

1 x! 

= eg WE? (4.9) 

(/27yP | Me 


which is the multivariate normal density function with mean vector pw and covariance 
matrix 2. When y has the density (4.9), we say that y is distributed as N,(m, 2) or 


90 MULTIVARIATE NORMAL DISTRIBUTION 


simply that y is N,(j, &). The subscript p is the dimension of the p-variate normal 
distribution and indicates the number of variables, that is, y is p x 1, wis p x 1, 
and & is p X p. 

A comparison of (4.9) and (4.3) shows the standardized distance (y — py’ 
Sly — ps) in place of (y — gw)’ /o? in the exponent and the square root of the gen- 
eralized variance |%| in place of the square root of a” in the denominator. [For stan- 
dardized distance, see (3.27), and for generalized variance, see (3.26).] These 
distance and variance functions serve analogous purposes in the densities (4.9) and 
(4.3). In (4.9), f(y) decreases as the distance from y to pm increases, and a small 
value of || indicates that the y's are concentrated closer to x than is the case when 
[| is large. A small value of |%| may also indicate a high degree of multicollinearity 
among the variables. High multicollinearity indicates that the variables are highly 
intercorrelated, in which case the y’s tend to occupy a subspace of the p dimensions. 


4.33 MOMENT GENERATING FUNCTIONS 


We now review moment generating functions, which can be used to obtain some of 
the properties of multivariate normal random variables. We begin with the univariate 
case. 

The moment generating function for a univariate random variable y is defined as 


M,(t) = E(e”), (4.10) 


provided E(e”) exists for every real number ¢ in the neighborhood —h < t < h for 
some positive number h. For the univariate normal N(, 07), the moment generating 
function of y is given by 


My(t) = eet? /2, (4.11) 


Moment generating functions characterize a distribution in some important ways 
that prove very useful (see the two properties at the end of this section). As their 
name implies, moment generating functions can also be used to generate moments. 
We now demonstrate this. For a continuous random variable y, the moment generating 

oe) 


function can be written as M,(t) = E(e”) = Tiss e” f(y) dy. Then, provided we can 
interchange the order of integration and differentiation,we have 


loo} 


ane) ye"f(y) dy. (4.12) 


dt 


= My(t) = | 


Setting t = 0 gives the first moment or mean: 


M,(0) = | yf) dy = E()). (4.13) 


4.3. MOMENT GENERATING FUNCTIONS 91 


Similarly, the Ath moment can be obtained using the kth derivative evaluated at 0: 
(Og) — k 
My," (0) = EG”). (4.14) 


The second moment, E(y”), can be used to find the variance [see (3.8)]. 
For a random vector y, the moment generating function is defined as 


My(t) = Elen t+" Fp) = E(et), (4.15) 


By analogy with (4.13), we have 


—— = Ey), (4.16) 


dM, (0) 
at 


where the notation 0M,(0)/Ot indicates that OM,(t)/Ot is evaluated at t=0. 
Similarly, OP My(t) /Ot,Ot; evaluated at t, = t, = 0 gives E(y,ys): 


My (0) 
Ot,Ot~ 


= E\yrys). (4.17) 


For a multivariate normal random vector y, the moment generating function is 
given in the following theorem. 


Theorem 4.3. If y is distributed as N,(m, %), its moment generating function is 
given by 


My(t) = et HER? (4.18) 


Proor. By (4.15) and (4.9), the moment generating function is 
My(t) = fi - | ket YYW" W/2 dy, (4.19) 


where k = 1/(/27)"|3|'/? and dy = dy, dy): -dy,. By rewriting the exponent, we 
obtain 


maid = | | ke ef BHERt/2-(y—B-20/E"y—B-20/2 gy (4.20) 


= otettdt/2 iy of ke (+2027 | Ly-(e+- 20] /2 dy (4.21) 


J 
_ et ett Xt/2 : 


92 MULTIVARIATE NORMAL DISTRIBUTION 


The multiple integral in (4.21) is equal to | because the multivariate normal 
density in (4.9) integrates to 1 for any mean vector, including w+ Xt. 


Corollary 1. The moment generating function for y — p is 


My_,(t) = e*"?, (4.22) 


We now list two important properties of moment generating functions. 


1. If two random vectors have the same moment generating function, they have 
the same density. 


2. Two random vectors are independent if and only if their joint moment gener- 
ating function factors into the product of their two separate moment generating 
functions; that is, if y’ = (y/,, y5) and t’ = (t), t,), then y, and y> are indepen- 
dent if and only if 


My(t) = My, (t1)My, (t2). (4.23) 


4.4 PROPERTIES OF THE MULTIVARIATE NORMAL 
DISTRIBUTION 


We first consider the distribution of linear functions of multivariate normal random 
variables. 


Theorem 4.4a. Let the p x 1 random vector y be N,(m, &), let a be any p x 1 
vector of constants, and let A be any k x p matrix of constants with rank k < p. 
Then 

(i) z= ay is N(a’p, a’Da) 

(ii) z= Ay is N.(Ap, AXA’). 
PROOF 

(i) The moment generating function for z = a’y is given by 

M,() = E(e*) = Ee) = Ee”) 


= elfay p+ (tay S(ta)/2 [by (4.18)] 


a ea! wtt(a'Zaye /2_ (4.24) 


4.4 PROPERTIES OF THE MULTIVARIATE NORMAL DISTRIBUTION 93 


On comparing (4.24) with (4.11), it is clear that z = a’y is univariate normal 
with mean a’ p and variance a’da. 


(ii) The moment generating function for z = Ay is given by 


M,(t) = E(e’”) = E(e'®), 


which becomes 


M,(t) = et AB) TE(AZA’)t/2 (4.25) 


(see Problem 4.7). By Corollary 1 to Theorem 2.6b, the covariance 
matrix AA’ is positive definite. Thus, by (4.18) and (4.25), the k x 1 
random vector z= Ay is distributed as the k-variate normal Ny(Ap, 
AXA’). 


Corollary 1. If b is any k x 1 vector of constants, then 


z=Ay+b is N(Awt+b, AXA’). 


The marginal distributions of multivariate normal variables are also normal, as 
shown in the following theorem. 


Theorem 4.4b. If y is N,(m, &), then any r x 1 subvector of y has an r-variate 
normal distribution with the same means, variances, and covariances as in the orig- 
inal p-variate normal distribution. 


Proor. Without loss of generality, let y be partitioned as y’ = (y/, y5), where y, is the 
r X 1 subvector of interest. Let jx and & be partitioned accordingly: 


_{¥ _{ MB _ (2% 2 
y=("). w= (2), a= (F =) 


Define A = (I,, O), where I, is an r x r identity matrix and O is an r x (p — r) 
matrix of Os. Then Ay=y,, and by Theorem 4.4a (ii), y, is distributed as 


Ny, Yu). 


Corollary 1. If y is N,(@, %), then any individual variable y; in y is distributed as 
N(M;, Gi). 

For the next two theorems, we use the notation of Section 3.5, in which the random 
vector v is partitioned into two subvectors denoted by y and x, where y is p x 1 and x 


94 MULTIVARIATE NORMAL DISTRIBUTION 


is q x 1, with a corresponding partitioning of yx and & [see (3.32) and (3.33)]: 


a fF i Fi = yY\_ (yy XY» 
vale) p=e(1)=(%), # =cov(¥) = (3 =): 


By (3.15), if two random variables y,; and y; are independent, then oj = 0. The 
converse of this is not true, as illustrated in Example 3.2. By extension, if two 
random vectors y and x are independent (i.e., each y; is independent of each x;), 
then > = O (the covariance of each y; with each x; is 0). The converse is not 
true in general, but it is true for multivariate normal random vectors. 


Theorem 4.4c. If v= (x) is Npig(m, %), then y and x are independent if 
Yn = 0. 


Proor. Suppose 2, = O. Then 
=f 2. 0 


and the exponent of the moment generating function in (4.18) becomes 


Me Xyy O t 
t +49xe= ”) +4, €9( x) )( ) 
- ‘ aa b, BO oP O ee t, 
= toy + tp, + 5UXyyt + st, hate: 
The moment generating function can then be written as 


J J / 
M, (t) = ety My tty Yoyty/2 ot, By tt Zarty/2 


which is the product of the moment generating functions of y and x. Hence, by (4.23), 
y and x are independent. 


Corollary 1. If y is N,(q™, &), then any two individual variables y; and y; are inde- 
pendent if oj = 0. 


Corollary 2. If y is N,(w, &) and if cov(Ay, By) = A2B’ = O, then Ay and By are 
independent. 

The relationship between subvectors y and x when they are not independent 
(Xy,. # O) is given in the following theorem. 


4.4 PROPERTIES OF THE MULTIVARIATE NORMAL DISTRIBUTION 95 
Theorem 4.4d. If y and x are jointly multivariate normal with 2, 4 O, then the 


conditional distribution of y given x, f(y|x), is multivariate normal with mean 
vector and covariance matrix 


E(y|x) = py, + Ya! (x — p,), (4.26) 
cov(y|x) = Yyy — ByxBa, Bay: (4.27) 
Proor. By an extension of (3.18), the conditional density of y given x is 


gly, X) 
h(x) ” 


fiy|x) = (4.28) 


where g(y, x) is the joint density of y and x, and A(x) is the marginal density of x. The 
proof can be carried out by directly evaluating the ratio on the right hand side of 
(4.28), using results (2.50) and (2.71) (see Problem 4.13). For variety, we use an 
alternative approach that avoids working explicitly with g(y, x) and A(x) and the 
resulting partitioned matrix formulas. 


Consider the function 
() -a|(¥) -(%)]. (4.29) 
u x B, 


where 


To be conformal, the identity matrix in Ay is p x p while the identity in Ay is g x gq. 
Simplifying and rearranging (4.29), we obtain w = y — [my + > I Tae (x — p,)] and 
u =x — p,. Using the multivariate change-of-variable technique [referred to in (4.6], 
the joint density of (w, u) is 
pw, u) = gly, »|A“'| = gly, x) 
[employing Theorem 2.9a (ii) and (vi)]. Similarly, the marginal density of u is 
q(u) = h(x)|I'| = hx). 

Using (3.45), it also turns out that 

cov(w, u) = AyZAy = Yyy — YypBe Vx = O (4.30) 


(see Problem 4.14). Thus, by Theorem 4.4c, w is independent of u. Hence 


p(w, u) = r(w)q(a), 


96 MULTIVARIATE NORMAL DISTRIBUTION 
where r(w) is the density of w. Since p(w, u) = g(y, x) and g(u) = h(x), we also have 
gly, x) = r(w)A(x), 


and by (4.28), 


Hence we obtain f(y|x) simply by finding r(w). By Corollary 1 to Theorem 4.4a, 
r(w) is the multivariate normal density with 


malls a 
BM, By, 


Divi =a Ai ZA} 
a, ae -1, (%y =a) ( I ) 
Ws (x? Sax) \ Big Bay 
= Sy — Sy dy Ze. (4.32) 


Thus r(w) = r(y — [my + pr aie (x — #t,)]) is of the form N,(0, 2))— +3 Se) 
Equivalently, y|x is Nol oy + poe (x—p,), By > ep Fe 

Since E(y|x) = mw, + pe Se (x — p,) in (4.26) is a linear function of x, any pair 
of variables y; and y; in a multivariate normal vector exhibits a linear trend 
E(yilyj) = Bi; + (4/00; — B;). Thus the covariance oj is related to the slope of 
the line representing the trend, and o;; is a useful measure of relationship between 
two normal variables. In the case of nonnormal variables that exhibit a curved 
trend, oj; may give a very misleading indication of the relationship, as illustrated in 
Example 3.2. 

The conditional covariance matrix cov(y|x) = 2) — >>>. in (4.27) does 
not involve x. For some nonnormal distributions, on the other hand, cov(y|x) is a 
function of x. 

If there is only one y, so that v is partitioned in the form 
V=(y, X1, %2,--.,Xq) = (y, x), then mw and > have the form 


_ (ph (ae oa, 
w= (i), B= (9 oe), 


where My and a, are the mean and variance of y, On = (G41, O42, .-., Oyq) Contains 


the covariances oy; = cov(y, x;), and x. contains the variances and covariances of 


4.4 PROPERTIES OF THE MULTIVARIATE NORMAL DISTRIBUTION 97 
the x variables. The conditional distribution is given in the following corollary to 
Theorem 4.4d. 

Corollary 1. If v = (y, x1, x2, ... ,Xg) = G, x’), with 


_{ Bb ssf Oy Oe 
(2) (2 2) 


EQ|x) = by + 0,3, — By), (4.33) 


then y|x is normal with 


var(y|x) = 03 — of. 3. Ox. (4.34) 


In (4.34), o>, Gx > 0 because Se is positive definite. Therefore 


var(y|x) < var(y). (4.35) 


Example 4.4a. To illustrate Theorems 4.4a—c, suppose that y is N3(4, &), where 


3 ae a 
p=[1}], S=[0 1 -1 
2 ct 3 


For z= y, —2y. + y3 = (1, — 2, Dy =a'y, we have au =3 and a’da= 19. 
Hence by Theorem 4.4a(i), z is N(3, 19). 
The linear functions 


Z=Yi—yoty3, 2% = 3y + yo — 2y3 


y1 
_fa\—_ fl -l 1 = 
2=(2)=(3 1 EH) >: |= ay 


Then by Theorem 3.6b(i) and Theorem 3.6d(i), we obtain 


aa (4): AXA’ = & 50) 


and by Theorem 4.4a(ii), we have 


romal(s)- (0 a9) 


can be written as 


98 MULTIVARIATE NORMAL DISTRIBUTION 


To illustrate the marginal distributions in Theorem 4.4b, note that y, is (3, 4), y3 is 


won (3)emel(i):(o 1) Gs) @1G)G 5)} 


To illustrate Theorem 4.4c, we note that 0,2 = 0, and therefore y, and y> are 


independent. 


Example 4.4b. To illustrate Theorem 4.4d, let the random vector v be N4(p, &), 
where 


‘) 9 0 3 3 
5 0. 1-£E. 
P=) 51) *=/]3 1 6 -3 
i go Gy aes 7 


If v is partitioned as v = (1, y2, X1, 2), then By = ). Bh, = a Sy = 
9 0 3 3 6 -3 ; 
c a ee 3 Jean Zax = (_§ 7 )- By (4.26), we obtain 
E(y|x) = my + By By & — My) 

(3) 3 °)( 6 ee) 

MS =, Des “7 x —1 

(5) 5(3) ie) 

ANS 33\-1 9 x2 —1 


10 
3+" +7” 
ar es ae 3 
37 33%! Ty 


By (4.27), we have 
cov(y|x) = Yyy = Se le 


er ea ke, 
(01) -8Car is) 


(26-94 
ride - aye 


|- 


os) 


4.4 PROPERTIES OF THE MULTIVARIATE NORMAL DISTRIBUTION 99 


Thus 


y|x is N2 


34 2x) +242 4 ( 8 4) 
wba tam) PL 14)) 


Example 4.4c. To illustrate Corollary 1 to Theorem 4.4d, let v be N4(js, &), where po 


and > are as given in Example 4.4b. If v is partitioned as v = (y, x), x2, x3)’, then pw 
and & are partitioned as follows: 


2 
_ (by\_ [ee 
eS () —2 , 


1 
9 0 3 3 
Ses aN of 1 1 
Dy, Lax 3} -l1 6 -3 
BO ea. 


By (4.33), we have 


E(y|x1, 25 %3) = By + 0, SEK — fy) 


(et By foes 
SP (O°973) | St. 2G. 3 


2.3 7 x34 1 


+95 12 6 9 
=F FMT 7X2 + 7X3. 


By (4.34), we obtain 
var(y|x1, X2, %3) = o, — a2 Oyx 


tet. SON eo 
92 O55) . he 8 3 


Hence y|x1, x2, x3 is NB — 2x + $x5 + 3X3, 3), Note that var(y|x1, x2, x3) = s 
is less than var(y) = 9, which illustrates (4.35). 


100 = MULTIVARIATE NORMAL DISTRIBUTION 
45 PARTIAL CORRELATION 


We now define the partial correlation of y; and y; adjusted for a subset of other y vari- 
ables. For convenience, we use the notation of Theorems 4.4c and 4.4d. The subset of 
y's containing y; and y; is denoted by y, and the other subset of y’s is denoted by x. 

Let v be Np44(m, &) and let v, ws, and & be partitioned as in Theorem 4.4c and 


4.4d: 
v= (7). a= (2) 2-(% Er) 


The covariance of y; and y, in the conditional distribution of y given x will be denoted 
by Ojj.rs...q, Where y; and y; are two of the variables in y and y,, ys, ..., Yq are all the 
variables in x. Thus oj.;s..q is the (ij)th element of cov(y|x) = 2) — SI ie For 
example, 013.567 represents the covariance between y, and y3 in the conditional dis- 
tribution of y,, y2, v3, y4 given ys, ye, and y7 [in this case x = (ys, yo, y7)’]. Similarly, 
022.567 tepresents the variance of y2 in the conditional distribution of y;, yo, y3, y4 
given ys, Ye, ¥7- 

We now define the partial correlation coefficient pj... to be the correlation 
between y,; and y, in the conditional distribution of y given x, where 
X= Oy Vis oe Yq) . From the usual definition of a correlation [see (3.19)], we 
can obtain pj.,._4 fom Oij-rs._.4: 


Oij-rs...g : (4.36) 


Oii-rs...q 7 jj-rs...g 


Pij-rs...g = 


This is the population partial correlation. The sample partial correlation rj.,5-+-q is 
discussed in Section 10.7, including a formulation that does not require normality. 
The matrix of partial correlations, P).. = (pj.,s..,) can be found by (3.30) and 
(4.27) as 
Ro D;,! YyxD>! (4.37) 


yx? 
where 2y.. = cov(y|x) = Yyy — > i xy and Dy. = [diag(y..)]!/ a3 

Unless y and x are independent (2), = O), the partial correlation Pij-rs..q 18 differ- 
ent from the usual correlation p; = oj/ \/Gji07j- In fact, p;.,,.., and pj can be of oppo- 


site signs (for an illustration, see Problem 4.16 g, h). To show this, we express Ojj.;5... 
in terms of oj. We first write Dx in terms of its rows 


Oy x1 Dyix. ves Dy ix, oi. 
Pyox1 Dyoxy ve Dyxq 05, 


Yyx = cov(y, x) = ; = : ; (4.38) 


Oy,x Dy, x2 aes Oy,xq oC 


PROBLEMS 101 


where G1, = (Dy,x,, Oy;x9 wees Oy;x,): Then Oj-rs..g, the (i)th element of 


ay : 
Yyy — Byx®,, Bay, can be written as 


—l 
Oij-rs..g = Fiji — A a O jx. (4.39) 


Suppose that o;; is positive. Then oj.;s..q 18 negative if o,.>.'0 jx > Oj. Note also 
that since >>: is positive definite, (4.39) shows that 


yl 
Oitrs..q = Ti — Vy, Tix < Oj. 


Example 4.5. We compare pj; and p,7.34 using yz and & in Example 4.4b. From &, 
we obtain 


_ 0712 _ 0 =) 
Pe Vane JO 
126 —24)\. . 
—l 
From cov(y|x) = 3; & 4 1 ) in Example 4.4b, we obtain 
_ 0712.34 = —24/33 = —24 
Pi2s4  TanisaOnsa  /(126/33\(14/33)  VG6N49) 
—4 
SS" = 571k 
7 
PROBLEMS 


4.1 Show that E(z) = 0 and var(z) = 1 when z has the standard normal density (4.1). 
4.2 Obtain (4.8) from (4.7); that is, show that |2~'/?] = [¥|-'/. 

4.3 Show that 0M,(0)/0t = E(y) as in (4.16). 

4.4 Show that O°My(0)/0t,0t, = E(y;ys) as in (4.17). 


4.5 Show that the exponent in (4.19) can be expressed as in (4.20); that is, 
show that ty —(y — p> ‘(y— w)/2=tp+tdt/2-(y—p— dd! 
(y— w— 2t)/2. 


4.6 Prove Corollary 1 to Theorem 4.3. 
4.7 Show that E(etY) = et Amtt(A2At/2 ag in (4.25). 


4.8 Consider a random variable with moment generating function M(t). Show that 
the second derivative of In[M(t)] evaluated at t=O is the variance of the 
random variable. 


4.9 Assuming that y is N,( 4, oI) and C is an orthogonal matrix, show that Cy is 
N,(Cp, 7D. 


102 MULTIVARIATE NORMAL DISTRIBUTION 


4.10 Prove Corollary 1 to Theorem 4.4a. 

4.11 Let A=(I,,0), as defined in the proof of Theorem 4.4b. Show that 
Ay =y,, AM = my, and AYA’ = X jy. 

4.12 Prove Corollary 2 to Theorem 4.4c. 

4.13 Prove Theorem 4.4d by direct evaluation of (4.28). 

4.14 Given w = y — Bx, show that cov(w, x) = 2), — B2&xx, as in (4.30). 

4.15 Show that E(y— > ME 6 = By — es mM, as in (4.31) and that 
covly — LyrBz'X) = Sy — LyBz, Voy as in (4.32). 

4.16 Suppose that y is N4(, &), where 


1 
2 

w= 3 ’ y= 
2 


Find the following. 


(a) The joint marginal distribution of y, and y3 

(b) The marginal distribution of y2 

(c) The distribution of z = y; + 2y2 — y3 + 3y4 

(d) The joint distribution of z; = y; + y2 — y3 — y4 and Zz = —3y; + y2 + 
2y3 — 2y4 

(e) f(1, yaly3, ya) 

(f) fO1, yaly25 Ya) 

(g) pis 

(h) p43.24 

(i) fOrly2, y3, ya) 


4.17 Let y be distributed as N3(y1, &), where 


2 410 
p=(-1}), S=[1 21 
3 (Ve Mees 


Find the following. 


(a) The distribution of z = 4y; — 6y2 + y3 

(b) The distribution of z = ( 2! *? Oe ) 
2y, + y2 — y3 

(c) f(y2|y1, y3) 

(d) f(1, yalys) 

(©) Piz and pj2.3 


4.20 


PROBLEMS 


If y is N3(, >), where 


which variables are independent? (See Corollary 1 to Theorem 4.4a) 


If y is N4(, &), where 


10 0 0 

02 0 0 
AS 00 3 —-4 }’ 

0 0 -4 6 


which variables are independent? 


Show that Gi,5°°-q = oj — 67,2, 6 jx as in (4.39). 


103 


5 Distribution of Quadratic 
Forms in y 


5.1 SUMS OF SQUARES 


In Chapters 3 and 4, we discussed some properties of linear functions of the random 
vector y. We now consider quadratic forms in y. We will find it useful in later chapters 
to express a sum of squares encountered in regression or analysis of variance as a 
quadratic form y'Ay, where y is a random vector and A is a symmetric matrix of con- 
stants [see (2.33)]. In this format, we will be able to show that certain sums of squares 
have chi-square distributions and are independent, thereby leading to F tests. 


Example 5.1. We express some simple sums of squares as quadratic forms in y. Let 
Y1, 2) «++» ¥y be arandom sample from a population with mean p and variance o. In 
the following identity, the total sum of squares }~" , y is partitioned into a sum of 
squares about the sample mean y = )~/_, y;/n and a sum of squares due to the mean: 


Sot=(Soot-m-) +a 
i=1 i=l 
=S° oi -5P + ny. (5.1) 
i=l 
Using (2.20), we can express )~? y? as a quadratic form 


Soy =yy=yly, 
i=1 


where y’ =(y1, yo, -.-, Yn). Using j= (1, 1, ..., 1)/ as defined in (2.6), we can 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


105 


106 DISTRIBUTION OF QUADRATIC FORMS IN y 


write ¥ as 


yji'y [by (2.18)] 


1\2 
0 y Jy — [by (2.23)] 


We can now write )~_, Qi — yy as 
2 ! ! 
Yor y) =oy — ny = y'ly — y(¢ sy 
i=l 


; 1 
=y|I--J]y. 
n 


Hence (5.1) can be written in terms of quadratic forms as 


1 1 
yly=y’ (1 7 ~3)y +y’ (23)s. 
n n 


(5.2) 


(5.3) 


The matrices of the three quadratic forms in (5.3) have the following properties: 


1 1 
1. I= (1-3) +—J. 
n n 


1 1 
2. I,1—-—J, and —J are idempotent. 
n n 


(-5)Q)-0 


5.2 MEAN AND VARIANCE OF QUADRATIC FORMS 107 


Using theorems given later in this chapter (and assuming normality of the y;,’s), these 
three properties lead to the conclusion that S~”_, (i — ¥)/@? and ny*/o? have chi- 
square distributions and are independent. 


5.2 MEAN AND VARIANCE OF QUADRATIC FORMS 
We first consider the mean of a quadratic form y’Ay. 


Theorem 5.2a. If y is a random vector with mean p and covariance matrix & and if 
A is a symmetric matrix of constants, then 


E(y'Ay) = tr(AX) + p/Ap. (5.4) 
Proor. By (3.25), } = E(yy)’ — wp’, which can be written as 
Eyy’) == + pe’. (5.5) 


Since y’Ay is a scalar, it is equal to its trace. We thus have 


E(y'Ay) = Eftr(y’Ay)] 
= Eftr(Ayy’)] [by (2.87)] 
= tr[E(Ayy’)] [by (3.5)] 
= wlAE(yy)] [by (3.40)] 
=A +e) [by 6.8)] 


=t[AS+App'] [by (2.15)] 
= tr(AZ) + tr(/Ap) [by (2.86)] 
= tr(Ad)+ pAp 


Note that since y’Ay is not a linear function of y, E(y’Ay) # E(y')AE(y). 


Example 5.2a. To illustrate Theorem 5.2a, consider the sample variance 


= Si 


n—-1 


S 


(5.6) 


By (5.2), the numerator of (5.6) can be written as 


: 1 
Yow =y(I-23)y, 


i=l 


108 DISTRIBUTION OF QUADRATIC FORMS IN y 


where y = (y}, y2, ---5 Yn)’. If the y’s are assumed to be independently distributed 
with mean yp and variance o”, then E(y) = (uw, b, .--, w) = pj and cov(y) = o7 1. 
Thus for use in (5.4) we have A = I — (1/n)J, Y = o°I, and b= pj; hence 


eS. (i= x x u(t 73) (o»| zi wit : 73) vi 
i=1 


1 
= ott (1 = 73) + e(F5- FI) [by 2.23)] 


=¢@ (n = “) +e (x — 7) [by (2.23)] 
n n 
= o7(n—1)+0. 
Therefore 
n =) _ 2 
E(s”) = Evin Qi —¥) — Gee =o’. (5.7) 


n—-1 n—-1 


Note that normality of the y’s is not assumed in Theorem 5.2a. However, normality 
is assumed in obtaining the moment generating function of y’Ay and var(y’ Ay) in the 
following theorems. 


Theorem 5.2b. If y is N,(q, &), then the moment generating function of y’Ay is 
Myay(t) = [I 2tA3 [71/2 eH O22 /2 (5.8) 
Proor. By the multivariate analog of (3.3), we obtain 


My'ay(t) = E(eY'Y) = | a | e'AY Ee O- WE '0-W)/2gy 


00 00 
=k, | de | e ly(U-2AE)E'y—2p'S "ys we a) dy, 
=o —0o 


where ky = 1/27 |5|"/7] and dy = dy; dy) ...dyp. For t sufficiently close to 0, 
I — 2rA® is nonsingular. Letting 6’ = (I — 2tA)~! and V~! = (1— 2rAd)> "7 
we obtain 


Myas(t) = kik | ze ke 9-9'V 0-9/2 gy 


5.2 MEAN AND VARIANCE OF QUADRATIC FORMS 109 


(Problem 5.4), where ky = (\/@ayP|V|!e tH "HPV 10/2 and = ky = 
1/[(\/@m)|V|'/7]. The multiple integral is equal to 1 since the multivariate 
normal density integrates to 1. Thus Myay(t) = k)ky. Substituting and simplifying, 
we obtain (5.8) (see Problem 5.5). 


Theorem 5.2c. If y is N,(m, 2), then 


var(y’/Ay) = 2tr[(AZ)"] + 4p/ALAp. (5.9) 


Proor. The variance of a random variable can be obtained by evaluating the second 
derivative of the natural logarithm of its moment generating function at t = 0 (see hint 
to Problem 5.14). Let C = I— 2tA®. Then, from (5.8) 


1 1 a 
k() = In [Myay(t)] = —51n|C|— 5 w' Cys lp. 


Using (2.117), we differentiate k(t) twice to obtain 


C 


bee Het. 
KO) = wC 1 7 C ly ‘on 


11 fac? 112a@\c) 1 
2\c/? | at 2|C| de 2 


dC)" 3 
+afe"s c's ‘pb 


(Problem 5.6). A useful expression for |C| can be found using (2.97) and (2.107). 
Thus, if the eigenvalues of AX are A;,i= 1, ..., p, we obtain 


P 
|c| = [[ d - 20) 
i=1 


= 1-2 STA +4P So AA; — + + (= DP 2PPALAQ “Ap. 
i iFj 


Then (d|C|/dt) = —23;A; + 8f3;4;AiAj;+ higher-order terms in ¢, and 
(d?|C|/dt?) = 83;4;Ai:A; + higher-order terms in t. Evaluating these expressions 
at t=0, we obtain |C|=1, (d|C|/dt)|,) = —22;A; = —2tr(AX) and 
(d?|C|/dt?)|,-9 = 82izjAiAj. For t=0 it is also true that C=I, C7! =I, 
(dC/dt)|,_) = 2A% and (d?C/dt)|,_) = O. Thus 


k"(O0) = 2[r(AZ)P —4$— A/a; +0 + 4y/AZAp 
ixj 


- of [1(AZ)P —25° sash + 4p/ADAm. 


iAj 


110 DISTRIBUTION OF QUADRATIC FORMS IN y 


By Problem 2.81, this can be written as 


2 tr[(AZ)?] + 4y/ADAp. 


We now consider cov(y, y’Ay). To clarify the meaning of the expression cov(y, 
y Ay), we denote y’Ay by the scalar random variable v. Then cov(y, v) is a column 
vector containing the covariance of each y,; and v: 


Dyiv 
Oyyv 
cov(y, v) = E{ly — E(y)][v — E(v)]} = : (5.10) 


Oy, v 


[On the other hand, cov(v, y) would be a row vector.] An expression for cov(y, y’ Ay) 
is given in the next theorem. 


Theorem 5.2d. If y is N,(m, &), then 
cov(y, y Ay) = 22Ap. (5.11) 
Proor. By the definition in (5.10), we have 
cov(y, y' Ay) = E{ly — E(y)Ily’Ay — E(y’Ay)]}. 
By Theorem 5.2a, this becomes 


cov(y, y'Ay) = E{(y — m)Ly’Ay — tr(AX) — w’Ap]}. 


Rewriting y'Ay — p’Ayp in terms of y — m (see Problem 5.7), we obtain 


cov(y, y Ay) = E{(y — wWI(y — mY AGy — w) + 2(y— wy Am—tr(AX)]}} 6.12) 
= El(y — py — pw) Ay — #)] + 2EL(y — wy — WAR] 
— Ely — p)tr(A®)] 
=0+2Ap—0. 


The first term on the right side is 0 because all third central moments of the 
multivariate normal are zero. The results for the other two terms do not depend on 
normality (see Problem 5.7). 


5.2 MEAN AND VARIANCE OF QUADRATIC FORMS 111 


Corollary 1. Let B be ak x p matrix of constants. Then 


cov(By, y’Ay) = 2B Aum. (5.13) 


y 


For the partitioned random vector v = @ ) , the bilinear form x/Ay was intro- 


duced in (2.34). The expected value of x’Ay is given in the following theorem. 


Theorem 5.2e. Let v = @ be a partitioned random vector with mean vector and 


covariance matrix given by (3.32) and (3.33) 
Y\_/{ PB, y\ _ (2 y 2x 
Ha) een = 


where y is p x 1, xis g x 1, and %, is p x q. Let A be aq x p matrix of constants. 
Then 
E(x’ Ay) = tr(AXy,) + MAB,. (5.14) 


Proor. The proof is similar to that of Theorem 5.2a; see Problem 5.10. 


Example 5.2b. To estimate the population covariance oxy = E[(x — @,)(y — My) in 
(3.10), we use the sample covariance 


drier i — Oi — 9) 


Sq = : 5.15 
9 — (5.15) 
where (x1, v1), (X2, Y2),---, Qn; Yn) iS a bivariate random sample from a population 
with means mw, and ,, variances o and me and covariance 0,,. We can write 


(5.15) in the form 


Dv xiy — ay _ = (1 /n)dy 


n—-1 n—-1 


(5.16) 


Sxy = 


where X = (x1,X2,...,Xn)' and y = (1, y2,.--, Yn)’. Since (x;, y;) is independent of 


(x;, y;) fori ¢ j, the random vector v = ) has mean vector and covariance matrix 


Me) as): 


cov( = es =) _ ("" ) 
x Yay hax Cyl eI) 


112 DISTRIBUTION OF QUADRATIC FORMS IN y 


where each I is nxn. Thus for use in (5.14), we have A=I-(1/n)J, 
Lyx = Kyl, pw, = pj, and By = Mj. Hence 


1 1 : 1 ‘ 
bx (1 = ~3)s| = u(t an =) ool + pj (1 = =) Byj 
22 iy Spies ial ares 
= Opnytt + ByBy\ SJ JNJ 
n n 
= Oy(n— 1) +0. 


Therefore 


Elia OH —DOI-PD] _— @— Day | 


E(x) = n—1 n—l 


Ory. (5.17) 


5.3. NONCENTRAL CHI-SQUARE DISTRIBUTION 


Before discussing the noncentral chi-square distribution, we first review the central 
chi-square distribution. Let z,, Z2,...,Z, be a random sample from the standard 
normal distribution M(0, 1). Since the z’s are independent (by definition of random 
sample) and each z, is N(O, 1), the random vector z = (Zz), Z2,.-., Zn) is distributed 
as N,,(0, I). By definition 


soz =zzis ¥'(n); (5.18) 
i=1 


that is, the sum of squares of n independent standard normal random variables is dis- 
tributed as a (central) chi-square random variable with n degrees of freedom. 

The mean, variance, and moment generating function of a chi-square random vari- 
able are given in the following theorem. 


Theorem 5.3a. If u is distributed as x°(n), then 


E(u) =n, (5.19) 
var(u) = 2n, (5.20) 


Proor. Since u is the quadratic form z’Iz, E(u), var(u), and M,,(t) can be obtained by 
applying Theorems 5.2a, 5.2c, and 5.2b, respectively. 


5.3. NONCENTRAL CHI-SQUARE DISTRIBUTION 113 


Now suppose that y;, y2,..., Y, are independently distributed as N(j;, 1), so that y 
is N, (mw, D), where pw = (1, M,-.., M,)'. In this case, Yi, y = y’y does not have a 
chi-square distribution, but £(y; — w)* = (y — p)'(y — mp) is ¥°(n) since y; — b,; is 
distributed as N(0,1). 

The density of v= Say = y’y, where the y’s are independently distributed as 
N(p,;, 1), is called the noncentral chi-square distribution and is denoted by 
x(n, A). The noncentrality parameter A is defined as 


ieee 1 
A=s) Mae E. (5.22) 
i=l 


Note that A is not an eigenvalue here and that the mean of v = >'_, y? is greater than 
the mean of u = S7_, (i — oy)": 


eS Oi - Ms = SU Foi - m4)” = diva) =) 1 =n, 
=I i=l tel ea 
e(Soxt) = Soevh = Sot +a = raved 
i=1 i=1 i i=l 


i=1 


| 


n 
=nt+ pp an+2a, 
i=l 


where A is as defined in (5.22). The densities of u and v are illustrated in Figure 5.1. 


x(n) 


(1,2) 


n 


Figure 5.1. Central and noncentral chi-square densities. 


114 DISTRIBUTION OF QUADRATIC FORMS IN y 


The mean, variance, and moment generating function of a noncentral chi-square 
random variable are given in the following theorem. 


Theorem 5.3b. If v is distributed as y(n, A), then 


E(v) = n+ 2A, (5.23) 
var(v) = 2n Tee (5.24) 
M,(t) = qe Coa. (5.25) 


Proor. For E(v) and var(v), see Problems 5.13 and 5.14. For M,(t), use Theorem 
5.2b. 


Corollary 1. If 4 = 0 (which corresponds to wy; = 0 for all i), then E(v), var(v), and 
M,(t) in Theorem 5.3b reduce to E(u), var(u), M(t) for the central chi-square distri- 
bution in Theorem 5.3a. Thus 


x(n, 0) = x(n). (5.26) 


The chi-square distribution has an additive property, as shown in the following 
theorem. 


Theorem 5.3c. If v), v2,..., vg are independently distributed as V(ni, A;), then 


k k k 
S |v; is distributed as x7 b ni, > | (5.27) 
i=1 i=1 i=1 


Corollary 1. If u,, u2,..., ug are independently distributed as x7(n;), then 


k k 
> u; is distributed as a (>: ») : 


i=l i=1 


5.4 NONCENTRAL F AND ¢ DISTRIBUTIONS 


5.4.1 Noncentral F Distribution 


Before defining the noncentral F distribution, we first review the central F. If u is 
(Pp), vis V(q)s and u and v are independent, then by definition 


w= Hie is distributed as F(p, q), (5.28) 
v 


/q 


5.4 NONCENTRAL F AND t DISTRIBUTIONS 115 


the (central) F distribution with p and q degress of freedom. The mean and variance 
of w are given by 


= 8G _ 2¢(p+q-2) 
= gegen 


(5.29) 


Now suppose that u is distributed as a noncentral chi-square random variable, 
x’(p, A), while v remains central chi-square random variable, y7(q), with u and v 
independent. Then 


z= ae is distributed as F(p, q, A), (5.30) 


v/q 


the noncentral F distribution with noncentrality parameter A, where A is the same 
noncentrality parameter as in the distribution of u (noncentral chi-square distribution). 
The mean of z is 


Ej) -—t- (1 +2), (5.31) 
q-2 Pp 


which is, course, greater than E(w) in (5.29). 

When an F statistic is used to test a hypothesis Ho, the distribution will typically be 
central if the (null) hypothesis is true and noncentral if the hypothesis is false. Thus 
the noncentral F distribution can often be used to evaluate the power of an F test. The 
power of a test is the probability of rejecting Hp for a given value of X. If F, is the 
upper @ percentage point of the central F distribution, then the power, P(p, g, a, 
A), can be defined as 


P(p, q, &, A) = Prob (z => Fa), (5.32) 


where z is the noncentral F random variable defined in (5.30). Ghosh (1973) showed 
that P(p, g, a, A) increases if g or a or A increases, and P(p, g, a, A) decreases if p 
increases. The power is illustrated in Figure 5.2. 

The power as defined in (5.32) can be evaluated from tables (Tiku 1967) or 
directly from distribution functions available in many software packages. For 
example, in SAS, the noncentral F-distribution function PROBF can be used to 
find the power in (5.32) as follows: 


P(p, g, a, 4) = 1 — PROBF(F 4, p, g, A). 


A probability calculator for the F and other distributions is available free of charge 
from NCSS (download at www.ncss.com). 


116 DISTRIBUTION OF QUADRATIC FORMS IN y 


F(p, q) 


F(p, q, &) 


a 


F. 


a 


Figure 5.2. Central F, noncentral F, and power of the F test (shaded area). 


5.4.2 Noncentral ¢ Distribution 


We first review the central ¢ distribution. If z is N(0,1), u is (Pp), and z and u are 
independent, then by definition 


t = —“— is distributed as 1(p), (5.33) 


Vu/p 


the (central) ¢ distribution with p degrees of freedom. 
Now suppose that y is N(y, 1), u is y°(p), and y and u are independent. Then 


is distributed as f(p, 12), (5.34) 


the noncentral ¢ distribution with p degrees of freedom and noncentrality parameter p. 
If y is N(p, a7), then 


pe is distributed as r(p, w/o), 


u/p 


since by (3.4), (3.9), and Theorem 4.4a(i), y/o is distributed as N(w/o, 1). 


5.5 DISTRIBUTION OF QUADRATIC FORMS 117 
5.5 DISTRIBUTION OF QUADRATIC FORMS 


It was noted following Theorem 5.3a that if y is N,,(q, D, then (y — p)/(y — p) is 
V(n). If y is N,(f, &), we can extend this to 


(y— p) XE '(y— wis Vn). (5.35) 


To show this, we write (y — ps)’ Sly — p) in the form 


y-wW > 'y-w=y—p e's yp 


= Ey - w)] [E'y ~ | 


J 
= ZZ, 


where z= > '/?(y—p) and 2 =(3'”)-!, with &'? given by (2.109). 
The vector z is distributed as N,,(0, I) (see Problem 5.17); therefore, z’z is Yn) by 
definition [see (5.18)]. Note the analogy of (y — pw)’ '(y — p) to the univariate 
random variable (y — py /o, which is distributed as ¥ (1) if y is N(p, a”). 

In the following theorem, we consider the distribution of quadratic forms in 


general. In the proof we follow Searle (1971, p. 57). For alternative proofs, see 
Graybill (1976, pp. 134-136) and Hocking (1996, p. 51). 


Theorem 5.5. Let y be distributed as N,(j2, &), let A be a symmetric matrix of con- 
stants of rank r, and let A =4p/Ap. Then y’Ay is y(r, A), if and only if AX is 
idempotent. 


Proor. By Theorem 5.2b the moment generating function of y’Ay is 
Myay(t) = [P= 2tA3 [71/29/20 20) 


By (2.98), the eigenvalues of I — 2rA® are 1 — 2td;, i= 1,2,..., p, where A; is an 
eigenvalue of AX. By (2.107), |[—2rA%|=]]?_,(1—2rA,). By (2.102), 
(I — 2rAd)-! = 14 72, (2)(AZ)*, provided —1 <2ta; <1 for all i. Thus 
Myay(t) can be written as 


P oo 7 = 
Myay(t) = (I (1— aay?) ewe Di ORC e, 


i=l 


118 DISTRIBUTION OF QUADRATIC FORMS IN y 


Suppose that A is idempotent of rank r (the rank of A); then r of the A,’s are equal 
to 1, p — r of the A;’s are equal to 0, and (AX)* = A®. Therefore, 


Myay(t) = (11 hes 21") eo /2m' [= YO, 20] AEE a 


i=1 


= (1-217 1/2p![1-(1—21)-1]Ap, 


provided —1 < 2t < 1 or— 5 <t< 5, which is compatible with the requirement that 
the moment generating function exists for t in a neighborhood of 0. Thus 
oe (1/2) wAplI- 1/120] 


AN dean 


which by (5.25) is the moment generating function of a noncentral chi-square 
random variable with degrees of freedom r= rank(A) and noncentrality parameter 
A=SplAp. 

For a proof of the converse, namely, if y’Ay is y7(r, A), then A is idempotent; see 
Driscoll (1999). 


Some corollaries of interest are the following (for additional corollaries, see 
Problem 5.20). 


Corollary 1. If y is N,(0, D, then y’Ay is x7(r) if and only if A is idempotent of 
rank r. 


Corollary 2. If y is N,(w, 07D, then y’Ay/o? is x°(r, w’Ap./207) if and only if A 
is idempotent of rank r. 


Example 5. To illustrate Corollary 2 to Theorem 5.5, consider the distribution of 
(n — 1)s?/o? = 0, Oi - y)*/o7, where y=(y1,y2,---, Yn) is distributed as 
N, (ej, oI) as in Examples 5.1 and 5.2 In (5.2) we have yey Oi - yy = 
y [I — /n)J]y. The matrix I— (1/n)J is shown to be idempotent in Problem 5.2. 
Then by Theorem 2.13d, rank [I — (1/n)J] = tr[I — ./n)J] =n — 1. We next find 
A, which is given by 


— WAP pj’ (I — ‘Dui _ wijj- 17/Jj) 
202 202 202 


= w(n— jj) = w[n — L(ny(n)] _ 
202 2c02 


0. 


Therefore, y’[I — (1/n)J]ly/o7 is x(n — 1). 


5.6 INDEPENDENCE OF LINEAR FORMS AND QUADRATIC FORMS 119 


5.6 INDEPENDENCE OF LINEAR FORMS AND 
QUADRATIC FORMS 


In this section, we discuss the independence of (1) a linear form and a quadratic form, 
(2) two quadratic forms, and (3) several quadratic forms. 

For an example of (1), consider y and s’ina simple random sample or B and s in 
a regression setting. To illustrate (2), consider the sum of squares due to regression 
and the sum of squares due to error. An example of (3) is given by the sums of 
squares due to main effects and interaction in a balanced two-way analysis of 
variance. 

We begin with the independence of a linear form and a quadratic form. 


Theorem 5.6a. Suppose that B is a k x p matrix of constants, A is a p X p sym- 
metric matrix of constants, and y is distributed as N,(, %). Then By and y’Ay are 
independent if and only if BYA = O. 


Proor. Suppose BYA = O. We prove that By and y’Ay are independent for the 
special case in which A is symmetric and idempotent. For a general proof, see 
Searle (1971, p. 59). 

Assuming that A is symmetric and idempotent, y’Ay can be written as 


y Ay = y’A’Ay = (Ay)/Ay. 
If BYA = O, we have by (3.45) 
BSA = cov(By, Ay) = O. 


Hence, by Corollary 2 to Theorem 4.4c, By and Ay are independent, and therefore 
By and the function (Ay) Ay are also independent (Seber 1977, pp. 17, 33-34). 

We now establish the converse, namely, if By and y’Ay are independent, then 
BSA = O. By Corollary 1 to Theorem 5.2d, cov(By, y’/Ay) = 0 becomes 


2BYAp = 0. 


Since this holds for all possible 4, we have BLA = O [see (2.44)]. 


Note that BXA = O does not imply ASB = O. In fact, the product AXB will not be 
defined unless B has p rows. 


Corollary 1. If y is N,(@, 071), then By and y’Ay are independent if and only if 
BA = 0. 


120 DISTRIBUTION OF QUADRATIC FORMS IN y 


Example 5.6a. To illustrate Corollary 1, consider s? = 3~"_, (yi — y)?/(n — 1) and 
y= >o,yi/n, where y = (1, y2, --. Yn)! is Na(uj, o71D). As in Example 5.1, ¥ 
and s* can be written as ~=(I/n)j'y and s? =y’[I—(1/n)Jly/(n— 1). By 


Corollary 1, ¥ is independent of s? since (1/n)j'{I — (1/n)J] = 0'. 


We now consider the independence of two quadratic forms. 


Theorem 5.6b. Let A and B be symmetric matrices of constants. If y is N,(m, 2), 
then y’Ay and y’By are independent if and only if ASB = O. 


Proor. Suppose AXB = O. We prove that y’Ay and y’By are independent for the 
special case in which A and B are symmetric and idempotent. For a general proof, 
see Searle (1971, pp. 59-60) or Hocking (1996, p. 52). 

Assuming that A and B are symmetric and idempotent, y’Ay and y’By can be 
written as y/Ay = yA’ Ay = (Ay)’Ay and y'By = y'B’By = (By)'By. If AB = O, 
we have [see (3.45)] 


ASB = cov(Ay, By) = O. 


Hence, by Corollary 2 to Theorem 4.4c, Ay and By are independent. It follows that 
the functions (Ay)’(Ay) = y’ Ay and (By) (By) = y’By are independent (Seber 1977, 
pp. 17, 33-34). 


Note that AXB = O is equivalent to B}A = O since transposing both sides of 
AB = O gives BXA = O (A and B are symmetric). 


Corollary 1. If y is N,(u, oI), then y’Ay and y’By are independent if and only if 
AB = O (or, equivalently, BA = O). 


Example 5.6b. To illustrate Corollary 1, consider the partitioning in (5.1), 
ye? = LO: — 9) + ny’, which was expressed in (5.3) as 


yy=y(- (/ndy + yA /n)dy. 


If y is N,(uj, o7D, then by Corollary 1, y’[I — (1/n)J]y and y’[(1/n)J]y are indepen- 
dent if and only if [[ — (1/n)J][C./n)J] = O, which is shown in Problem 5.2. 


The distribution and independence of several quadratic forms are considered in the 
following theorem. 


PROBLEMS 121 


Theorem 5.6c. Let y be N,(mu, a7), let A; be symmetric of rank 1, for 
i= 1,2, ...,k, and let y'Ay = ear y Aiy, where A = a A; is symmetric of 
rank r. Then 


(i) yAy/o? is x°(7;, p'Aip/207), i= 1, 2, ...,k. 
(ii) y’Ajy and y’Ay are independent for all i # j. 
(iii) y'Ay/o? is (7, pw’ Ap/207). 


These results are obtained if and only if any two of the following three statements 
are true: 


(a) Each A, is idempotent. 
(b) A;Aj = O for all i  j. 
(c) A= ear A; is idempotent. 


Or if and only if (c) and (d) are true, where (d) is the following statement: 


@) r= 0. 


Proor. See Searle (1971, pp. 61-64). 


Note that by Theorem 2.13g, any two of (a), (b), or (c) implies the third. 

Theorem 5.6c pertains to partitioning a sum of squares into several component 
sums of squares. The following corollary treats the special case where A = I; that 
is, the case of partitioning the total sum of squares y’y into several sums of squares. 


Corollary 1. Let y be N,,(j1, 071), let A; be symmetric of rank 7; for i= 1, 2, ..., k, 
and let y'y = De y Aiy. Then (i) each y'Ajy/o? is y°(r;, pw’ Ajpft/207) and (ii) the 
y A,y terms are mutually independent if and only if any one of the following state- 
ments holds: 


(a) Each A, is idempotent. 
(b) A;A; = O for all iF j. 
(c) n= ae Ij. 


Note that by Theorem 2.13h, condition (c) implies the other two conditions. Cochran 
(1934) first proved a version of Corollary 1 to Theorem 5.6c. 


PROBLEMS 


5.1 Show that 377, (i — 9)? = 0, 9? — ny? as in (5.1). 


122 


5.2 


5.3 


5.4 


5.5 


5.6 


5.7 
5.8 
5.9 
5.10 
5.11 


5.12 
5.13 
5.14 


DISTRIBUTION OF QUADRATIC FORMS IN y 
Show that (1/n)J is idempotent, I—(1/n)J is idempotent, and 
{I — (1/n)J][C. /n)J] = O, as noted in Section 5.1. 


Obtain var(s~) in the following two ways, where s” is defined in (5.6) as 
s? =~, 0; —9)?/(™— 1) and we assume that y= (y1, y2,---, Yn)’ is 
N, (uj, 07D). 


(a) Write s* as s* = y’[I — (1/n)J]y/(n — 1) and use Theorem 5.2b. 
(b) The function u = (n — 1)s” / oa is distributed as Vn — 1), and therefore 
var(u) = 2(n — 1). Then var(s?) = var[o7u/(n — 1)]. 


Show that 
[Oy [e/a ws HIV" 0/2 
= |I — 2rAd|e~"/"'[L — = 2tAd) Sp /2 
as in the proof of Theorem 5.2b, where 0! = w/(I—2rA%)! and 
v1 =(—-2rAd)>1. 
Show that 


eV G-2UAD)E 'y-2WStyt wl" al/2 — p(w Ew OV '0)/2,-(y-B'Vy—0)/2 


as in the proof of Theorem 5.2b, where 0! = w/(I—2rA)! and 
v= —2raAd)51. 


Let k(t) = —41n|C| —5p'(1- C7!) 7! was in the proof of Theorem 5.2c, 
where C is a nonsingular matrix. Derive k(t). 


Show that y/Ay — p’Ap = (y — p)'A(y — pw) + 2(y — w)/Apas in (5.12). 
Obtain the three terms 0,22Ay, and 0 in the proof of Theorem 5.2d. 
Prove Corollary 1 to Theorem 5.2d. 

Prove Theorem 5.2e. 


(a) Show that >", (x; — x)(yi — ¥) in (5.15) is equal to 7", xiy; — nxy 
in (5.16). 

(b) Show that 5°", xy; —nxy=x'[I—(1/n)Jly, as in (5.16) in 
Example 5.2. 


Prove Theorem 5.3a. 
If v = y(n, A), use Theorem 5.2c to show that var(v) = 2n + 8A as in (5.24). 


If v is x°(n,A), use the moment generating function in (5.25) to find 
E(v) and var(v). [Hint: Use In[M,(t)]; then d1In[M,(0)]/dt = E(v) and 
d* In{M,(0)]/dt =var(v) (see Problem 4.8). The notation d In[{M,(0)]/dt 


5.20 


5.21 
5.22 
5.23 
5.24 


5.25 


PROBLEMS 123 


indicates that dIn[M,(t)]/dt is evaluated at t=O; the notation 
d* In[M,(0)]/dt? is defined similarly.] 


Prove Theorem 5.3c. 
(a) Show that if t = z/,/u/p is t(p) as in (5.33), then f? is F (1, p). 
(b) Show that if t = y/,/u/p is t(p, w) as in (5.34), then 7? is F(1, p, 5h). 


Show that }~'/?(y — ) is N,,(0, I), as used in the illustration at the beginning 
of Section 5.5. 


(a) Prove Corollary 1 of Theorem 5.5a. 

(b) Prove Corollary 2 of Theorem 5.5a. 

If y is N,(m, =), verify that (y — w)/S~'(y — mw) is x°(n), as in (5.25), by 
using Theorem 5.5a. What is the distribution of y’ yy? 


Prove the following additional corollaries to Theorem 5.5a: 


(a) If y is N,(0, X), then y’Ay is y°(v) if and only if AX is idempotent of 
rank r. 

(b) If y is N,(m, 071), then y’y/o? is x°(p, pw’ p/207). 

(c) If y is N,(q, I), then y/Ay is x7(r, 5 ’Ap) if and only if A is idempotent 
of rank r. 

(d) If y is N,(m, 07%), then y'Ay/o? is x7(r, w’Ap/207) if and only if AX 
is idempotent of rank r. 

(e) Ify is N)(m, 07), then y'E'y/o? is °(p, wE'p/20”). 


Prove Corollary 1 of Theorem 5.6a. 
Show that j’[I — (1/n)J] = 0’, as in Example 5.6a. 
Prove Corollary 1 of Theorem 5.6b. 


Suppose that y,, y2,..., Y, is a random sample from N(y, 07) so that 
Y = (V1, ¥25--+5 Yn) is Ny(uj, 071). It was shown in Example 5.5 that 
(n — 1)s?/0? = 32", (yi —¥)?/o? is y?(n— 1). In Example 5.6a, it was 
demonstrated that ¥ and s? = 37, (y; — y)?/(n — 1) are independent. 
(a) Show that is N(w, 07/n). 
(b) Show that t = (y — p)/(s/,/n) is distributed as t (n — 1). 
(c) Given to # p, show that t = (¥ — uo)/(s/./n) is distributed as 

t(n — 1, 6). Find 6. 


Suppose that y is N,,(uj, 071). Find the distribution of 


= ny 
D1 Oi — 3 /@-D 
(This statistic could be used to test Ho: uw = 0.) 


124 DISTRIBUTION OF QUADRATIC FORMS IN y 


5.26 Suppose that y is N,(f, 2), where w= pj and 


1 p p 
1 p 

aa 
p p 1 


Thus E(y;) = p for all i, var(y;) = o? for all i, and cov(y;, y;) = 7p for all 
i# j; that is, the y’s are equicorrelated. 


(a) Show that ¥ can be written in the form ¥ = o7[(1 — p)I+ pJ]. 
(b) Show that 2". (y: — 9)?/[o2(1 — p)] is x2( — 1). 


5.27 Suppose that y is N3(m, %), where 


2 1 
m=|{-l1], Y= 11 2 1 
3 1 3 
Let 
1 -—3 -8 
A= | -3 2 -6 
—8 -6 3 


(a) Find E (y’Ay). 

(b) Find var (y’Ay). 

(c) Does y’Ay have a chi-square distribution? 

(d) If Y = oI, does y’Ay/o? have a chi-square distribution? 


5.28 Assuming that y is N3(m, 2), where 


3 2 0 O 
Bh= <=) ’ >= 0 4 0 > 
1 0 0 3 


find a symmetric matrix A such that y’Ay is x°(3, 3m’Am). What is 
A=tp/Ap? 


5.29 


5.30 


5.31 


5.32 


PROBLEMS 
Assuming that y is N4(y, &), where 
3 1 0O 0 0 
(4) x-(82 os 
4 0 O —-4 6 


find a matrix A such that y’Ay is x°(4, 5’Ap). What is A =5p/Ap? 


Suppose that y is N3(q, 071) and let 


> 


3 if 2-1 -1 : 
p= —2 A=-{-1 2 -1], B= ( 
1 et tts “2 


(a) What is the distribution of y’Ay/a?? 
(b) Are y’Ay and By independent? 
(c) Are y’Ay and y,;+ y+ y3 independent? 


Suppose that y is N3(j, 071), where pe = (1, 2, 3)’, and let 


ae | 
Bet 4 a a 
t va 


(a) What is the distribution of y By/a7? 
y By 


1 
0 


1 
—1 


). 


125 


(b) Is y’By independent of y’Ay, where A is as defined in Problem 5.30? 


Suppose that y is N,(q, 071) and that X is an n x p matrix of constants with 


rank p <n. 


(a) Show that H = X(X'X)'X’ and I— H = I— X(X’X) 'X’ are idempo- 


tent, and find the rank of each. 


(b) Assuming p is a linear combination of the columns of X, that is w= Xb for 
some b [see (2.37)], find E(y’Hy) and E[y'(I — H)y], where H is as defined 


in part (a) . 
(c) Find the distributions of y'Hy/o? and y/(I — H)y/o”. 
(d) Show that y’Hy and y’(I — H)y are independent. 
(e) Find the distribution of 


y'Hy/p 
y (I — Wy/(n — p) 


6 Simple Linear Regression 


6.1 THE MODEL 


By (1.1), the simple linear regression model for n observations can be written as 
y= Pot+Bixite, i= 1,2,...,n. (6.1) 


The designation simple indicates that there is only one x to predict the response y, and 
linear means that the model (6.1) is linear in Bp and B,. [Actually, it is the assumption 
E( yi) = Bo + B,x; that is linear; see assumption | below.] For example, a model such 
as yj = By + Bx; + ¢; is linear in Bo and B,, whereas the model y; = By + e8' + &; 
is not linear. 

In this chapter, we assume that y, and e; are random variables and that the values of 
x; are known constants, which means that the same values of x,,x2, ... ,X, would be 
used in repeated sampling. The case in which the x variables are random variables is 
treated in Chapter 10. 

To complete the model in (6.1), we make the following additional assumptions: 


1. E(e;) = 0 for all i= 1, 2,...,7, or, equivalently, E(y;) = Bo + Bx. 
2. var(e;) = o for all i= 1, 2,...,n, or, equivalently, var(y;) = o. 


3. cov(é;, €;) = 0 for all i¥ j, or, equivalently, cov(y,, y;) = 0. 


Assumption | states that the model (6.1) is correct, implying that y; depends only on x; 
and that all other variation in y; is random. Assumption 2 asserts that the variance of ¢ 
or y does not depend on the values of x;. (Assumption 2 is also known as the assump- 
tion of homoscedasticity, homogeneous variance or constant variance.) Under 
assumption 3, the ¢ variables (or the y variables) are uncorrelated with each other. 
In Section 6.3, we will add a normality assumption, and the y (or the e€) variables 
will thereby be independent as well as uncorrelated. Each assumption has been 
stated in terms of the e’s or the y’s. For example, if var(e;) = o”, then 


var(y;) = Ely; — EQ) = EQ: — Bo — Bix)” = E(e?) = o°. 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


127 


128 SIMPLE LINEAR REGRESSION 


Any of these assumptions may fail to hold with real data. A plot of the data 
will often reveal departures from assumptions | and 2 (and to a lesser extent assump- 
tion 3). Techniques for checking on the assumptions are discussed in Chapter 9. 


6.2 ESTIMATION OF Bp, B,, AND o? 


Using a random sample of n observations y;, y2, ..., y, and the accompanying fixed 
values x), X2,...,X,, We can estimate the parameters Bo, B,, and a”. To obtain the 
estimates Bo and Bi. we use the method of least squares, which does not require 
any distributional assumptions (for maximum likelihood estimators based on normal- 
ity, see Section 7.6.2). 

In the least-squares approach, we seek estimators Bo and B, that minimize the sum 
of squares of the deviations y; — jy; of the n observed y;’s from their predicted 


values jy; = Bo + Bi xi: 


ZO iP =D By — Bixiy. (6.2) 


lerba 
(> 
II 
ae 
> 
whe 
lI 
7 Ee 


Note that the predicted value ; estimates E(y;), not ya tha that is, Bo + Bixi estimates 


Bo + Bi xi, not By + B)x; + &;. A better notation would be EV; y;), but ¥; is commonly used. 
To find the values of Bo and Bi that minimize é’é in (6.2), we differentiate with 


respect to Bo and By and set the results equal to 0: 


Oz! n m s 
<= —-2 "(1 — By — Bix) = 0, (6.3) 
0 0 i=1 
awe ” gh 
—=—-25° (yi — By — Bixim = 0. (6.4) 
OB, i=1 


The solution to (6.3) and (6.4) is given by 


A. ear imy i Gi 9) 
Peon a G2) 
Bo = ¥ — Bix. (6.6) 


To verify that Bo and Bi in (6.5) and (6.6) minimize é’é in (6.2), we can examine the 
second derivatives or simply observe that é’@ has no maximum and therefore the first 


6.2 ESTIMATION OF Bp, Bi, AND o® 129 


derivatives yield a minimum. For an algebraic proof that Bo and Bi minimize (6.2), 
see (7.10) in Section 7.3.1. 


Example 6.2. Students in a statistics class (taught by one of the authors) claimed that 
doing the homework had not helped prepare them for the midterm exam. The exam 
score y and homework score x (averaged up to the time of the midterm) for the 18 
students in the class were as follows: 


y x y x y x 
95 96 72 89 35 0 
80 77 66 47 50 30 

0 0 98 90 72 59 
0 0 90 93 55 77 
79 78 0 18 75 74 
77 64 95 86 66 67 


Using (6.5) and (6.6), we obtain 
pie Deiat Ais — NAY 
ee XF — nx? 


— 81,195 — 18(58.056)(61.389) 
80,199 — 18(58.056)" 


By = ¥ — ByX = 61.389 — .8726(58.056) = 10.73. 


= 8726, 


The prediction equation is thus given by 
y = 10.73 + .8726x. 


This equation and the 18 points are plotted in Figure 6.1. It is readily apparent in the 
plot that the slope Bi is the rate of change of as x varies and that the intercept Boi is 
the value of at x = 0. 

The apparent linear trend in Figure 6.1 does not establish cause and effect between 
homework and test results (for inferences that can be drawn, see Section 6.3). The 
assumption var(é;) = o” (constant variance) for all i= 1,2,..., 18 appears to be 
reasonable. 


Note that the three assumptions in Section 6.1 were not used in deriving the least- 
squares estimators Bo and By i in (6.5) and (6.6). It is not necessary that }; = Bo + BiXi 
be based on E(y;) = By + B,x;; that is, j; = Bo + Bix can be fit to a set of data for 
which E(y;) # Bo + B,x;. This is illustrated in Figure 6.2, where a straight line has 
been fitted to curved data. 


130 SIMPLE LINEAR REGRESSION 


100 


80 


60 


40 


20 


0 20 40 60 80 100 
x 


Figure 6.1 Regression line and data for homework and test scores. 


However, if the three assumptions in Section 6.1 hold, then the least-squares esti- 
mators Bo and Bi are unbiased and have minimum variance among all linear unbiased 
estimators (for the minimum variance property, see Theorem 7.3d in Section 7.3.2; 


note that Bo and By are linear functions of y,y2,...,y,). Using the three 


.. x 


Figure 6.2 A straight line fitted to data with a curved trend. 


6.2 ESTIMATION OF Bp, Bi, AND o° 131 


assumptions, we obtain the following means and variances of Bo and Bi: 


E(B,) = By (6.7) 
E(Bo) = Bo (6.8) 
var(B;) = ae (6.9) 
VS i — xP 
mae = (6.10) 
: 1 See Gay 


Note that in discussing E(B) and var(B,), for example, we are considering 
random variation of Bi from sample to sample. It is assumed that the n values x, 
X2,...,X, would remain the same in future samples so that var(,) and var(By) 
are constant. 

In (6.9), we see that var(B;) is minimized when ae (x; — X)° is maximized. If 
the x; values have the range a < x; < b, then ay (xj — x) is maximized if half 
the x’s are selected equal to a and half equal to b (assuming that n is even; see 
Problem 6.4). In (6.10), it is clear that var(Bo) is minimized when x = 0. 

The method of least squares does not yield an estimator of var(y;) = a7; minimiz- 
ation of é’& yields only Bo and Bi. To estimate 0”, we use the definition in (3.6), 
o = Ely; — EQ). By assumption 2 in Section 6.1, a7 is the same for each 
y; (= 1,2, ...,n. Using y; as an estimator of E(y;), we estimate o by an average 
from the sample, that is 


is a= sy = Vi Oi — Bo Pr BixiP _ SSE (6.11) 


: n—2 n—2 n—2° 


where Bo and By are given by (6.5) and (6.6) and SSE = 5°, (7; — 3)”. The deviation 
&; = y; — 9; is often called the residual of y;, and SSE is called the residual sum of 
squares or error sum of squares. With n—2 in the denominator, s” is an unbiased 
estimator of 07: 


E(s”) = 


_ 2 


Intuitively, we divide by n—2 in (6.11) instead of n—-—1 as in 
x? =>>,(; — 9)? /(— 1) in (5.6), because $j; = Bo + Bx; has two estimated para- 


meters and should thereby be a better estimator of E(y,) than y. Thus we 


132 SIMPLE LINEAR REGRESSION 


expect SSE = 97, (y; — 5;)* to be less than 57, (y; — »)”. In fact, using (6.5) and (6.6), 
we can write the numerator of (6.11) in the form 


ee ee [eG =Dor- 7D) 
SSE = 20 iy = 20% y) ae Ge 3p , (6.13) 


which shows that }~, (9; — 5i)° is indeed smaller than )>, (y; — ¥)’. 


6.3 HYPOTHESIS TEST AND CONFIDENCE INTERVAL FOR pf, 


Typically, hypotheses about 6; are of more interest than hypotheses about Bp, since 
our first priority is to determine whether there is a linear relationship between y and x. 
(See Problem 6.9 for a test and confidence interval for Bo.) In this section, we con- 
sider the hypothesis Ho: 6; = 0, which states that there is no linear relationship 
between y and x in the model y; = By + B,x; + €;. The hypothesis Ho:6, = c (for 
c # 0) is of less interest. 

In order to obtain a test for Ho: B, = 0, we assume that y; is N(By + B, xi, 0°). 
Then B, and s* have the following properties (these are special cases of results estab- 
lished in Theorem 7.6b in Section 7.6.3): 


1. By is N[By, 07/7; 0; — ¥)”]. 
2. (n— 2)s?/o? is x(n — 2). 
3. Bi and s* are independent. 


From these three properties it follows by (5.29) that 


(6.14) 


_ Bi 
s/f Die — 3? 


is distributed as t(n—2, 6), the noncentral f with noncentrality parameter 6. 


By a comment following (5.29), 6 is given by 5 = E(B,)/ var(B,) 


= B,/[o/,/ >>; Gi — x)’]. If B, = 0, then by (5.28), t is distributed as t(n—2). For 
a two-sided alternative hypothesis H,:8, #0, we reject Hjy:6,;=0 if 
It] = ta/2, n-2, Where fy/2, n—2 is the upper a/2 percentage point of the central  distri- 
bution and a is the desired significance level of the test (probability of rejecting Ho 
when it is true). Alternatively, we reject Ho if p <a, where p is the p value. For a two- 
sided test, the p value is defined as twice the probability that t(n—2) exceeds the 
absolute value of the observed t. 


6.4 COEFFICIENT OF DETERMINATION 133 


A 1001 — @)% confidence interval for B; is given by 


AY 


Jeace ast (6.15) 


By + te/2, n-2 


Confidence intervals are defined and discussed further in Section 8.6. A confidence 
interval for E(y) and a prediction interval for y are also given in Section 8.6. 


Example 6.3. We test the hypothesis Hp: 8; = 0 for the grades data in Example 6.2. 
By (6.14), the ¢ statistic is 


By 8726 


é s/f Dies Oa — %° ~ 3.8547)/39.753) — °80?°- 


t 


Since t = 8.8025 > toos, 16 = 2.120, we reject Ho: B, = 0 at the a =.05 level of sig- 
nificance. Alternatively, the p value is 1.571 x 10~’, which is less than .05. 
A 95% confidence interval for B; is given by (6.15) as 


A S 
| 


By + tos, 16 : 
n = 
\V/ Sisy (xj — x) 


.8726 + 2.120(.09914) 
.8726 + .2102 
(.6624, 1.0828). 


6.4 COEFFICIENT OF DETERMINATION 


The coefficient of determination r° is defined as 


R Be Pe 
po ey (6.16) 
SST dyi=1 01 — Y) 


where SSR = 95; (3; — ¥)? is the regression sum of squares and SST = 01 - yy 
is the total sum of squares. The total sum of squares can be partitioned into SST = 
SSR + SSE, that is, 


> Oi - WD? = Gi - WP + 4-H”. (6.17) 
i=1 i=1 i=1 


134 SIMPLE LINEAR REGRESSION 


Thus r* in (6.16) gives the proportion of variation in y that is explained by the 
model or, equivalently, accounted for by regression on x. 

We have labeled (6.16) as r* because it is the same as the square of the sample 
correlation coefficient r between y and x 


ie Sxy = ee (x; — Xi es y) 
(983 /[ @i — 97] [IL OF — FF] 
where 5, is given by 5.15 (see Problem 6.11). When x is a random variable, r 


estimates the population correlation in (3.19). The coefficient of determination r? is 
discussed further in Sections 7.7, 10.4, and 10.5. 


(6.18) 


Example 6.4. For the grades data of Example 6.2, we have 


2 _ SSR _ 14,873.00 _ gg 
SST 17, 944.3 


The correlation between homework score and exam score is r = vy .8288 = .910. 
The ¢ statistic in (6.14) can be expressed in terms of r as follows: 


t= Bi (6.19) 


— 6.20 
ise wae 


If Ho: B; = 0 is true, then, as noted following (6.14), the statistic in (6.19) is dis- 
tributed as t(n—2) under the assumption that the x,’s are fixed and the y,’s are inde- 
pendently distributed as N(B) + B,x;, 77). If x is a random variable such that x and y 
have a bivariate normal distribution, then t = Vn — 2 r/V1— r? in (6.20) also has 
the t(n— 2) distribution provided that Ho : p = 0 is true, where p is the population cor- 
relation coefficient defined in (3.19) (see Theorem 10.5). However, (6.19) and (6.20) 
have different distributions if Hp : 8; = 0 and Hp: p = 0 are false (see Section 10.4). 
If B, 4 O, then (6.19) has a noncentral ¢ distribution, but if p 4 0, (6.20) does not 
have a noncentral ¢ distribution. 


PROBLEMS 


6.1 Obtain the least-squares solutions (6.5) and (6.6) from (6.3) and (6.4). 


6.2 (a) Show that E(B) = B, as in (6.7). 
(b) Show that E(By) = Bp as in (6.8). 


6.3 


6.4 


6.5 


6.6 


6.7 


6.8 


6.9 


6.10 
6.11 


PROBLEMS 135 


(a) Show that var(8,) = o2/ yo, Gi — x) as in (6.9). 
(b) Show that var(89) = o7[1/n + 3?/ 37", (7 — X)°] as in (6.10). 
Suppose that n is even and the n values of x; can be selected anywhere in the 


interval from a to b. Show that var(B,) is a minimum if n/2 values of x; are 
equal to a and n/2 values are equal to b. 


Show that SSE = S~7_, (i — 3;)° in (6.11) can be expressed in the form given 
in (6.13). 


Show that E(s*) = o? as in (6.12). 

Show that t = B,/[s/ yO, i —X)"] in (6.14) is distributed as t(n—2, 8), 
where 6 = B,/[o/\/ >>; (4 — X)’]. 

Obtain a test for Hp: 8B; = c versus H,: By # c. 


(a) Obtain a test for Hp: By = a versus H,: Bo # a. 
(b) Obtain a confidence interval for Bp. 
Show that 37, 6; — 9° = 3, Gi — 9 + XL, Oi — 54)” as in (6.17). 


Show that r? in (6.16) is the square of the correlation 


Diet Gi — Oi — 5) 
J Dh @ [TL 6-H] 


r= 


as given by (6.18). 


TABLE 6.1 Eruptions of Old Faithful Geyser, August 1-4, 1978 


y x y x y x y x 
78 4.4 80 4.3 76 4.5 715 4.0 
74 3.9 56 1.7 82 3.9 73 3.7 
68 4.0 80 3.9 84 4.3 67 3:7 
76 4.0 69 3.7 53 25 68 4.3 
80 3.5 57 3.1 86 3.8 86 3.6 
84 4.1 90 4.0 51 1.9 72 3.8 
50 2.3 42 1.8 85 4.6 715 3.8 
93 4.7 91 4.1 45 1.8 75 3.8 
55 1.7 51 1.8 88 4.7 66 25 
76 4.9 79 3.2 51 1.8 84 4.5 
58 1.7 33 1.9 80 4.6 70 4.1 
74 4.6 82 4.6 49 1.9 79 3.7 
715 3.4 51 2.0 82 3.5 60 3.8 
— _— — — —_— — 86 3.4 


“Where x = duration, y = interval (both in minutes). 


136 SIMPLE LINEAR REGRESSION 


6.12 Show that r= cos 0, where @ is the angle between the vectors x — xj and 
y — yj, where x — Xj = (1) — X, x2 —X, ...,X%, — x)! and y—jyj = 01-9, 
Y2— Vs +e Vn — yy. 

6.13 Show that t= B,/Ls/ peeres — x)’] in (6.19) is equal to t= /n—2 r/ 
V1—r? in (6.20). 

6.14 Table 6.1 (Weisberg 1985, p. 231) gives the data on daytime eruptions of Old 
Faithful Geyser in Yellowstone National Park during August 1—4, 1978. The 
variables are x = duration of an eruption and y = interval to the next eruption. 


Can x be used to successfully predict y using a simple linear model 
Yi = Bo + Bix + €;? 


(a) Find By and B;. 

(b) Test Ho : B; = 0 using (6.14). 
(c) Find a confidence interval for B;. 
(d) Find r? using (6.16). 


7 Multiple Regression: Estimation 


7.1. INTRODUCTION 


In multiple regression, we attempt to predict a dependent or response variable y on 
the basis of an assumed linear relationship with several independent or predictor vari- 
ables x;, x;,...,x,. In addition to constructing a model for prediction, we may wish 
to assess the extent of the relationship between y and the x variables. For this purpose, 
we use the multiple correlation coefficient R (Section 7.7). 

In this chapter, y is a continuous random variable and the x variables are fixed con- 
stants (either discrete or continuous) that are controlled by the experimenter. The case 
in which the x variables are random variables is covered in Chapter 10. In analysis-of- 
variance (Chapters 12—15), the x variables are fixed and discrete. 

Useful applied expositions of multiple regression for the fixed-x case can be found 
in Morrison (1983), Myers (1990), Montgomery and Peck (1992), Graybill and Iyer 
(1994), Mendenhall and Sincich (1996), Ryan (1997), Draper and Smith (1998), and 
Kutner et al. (2005). Theoretical treatments are given by Searle (1971), Graybill 
(1976), Guttman (1982), Kshirsagar (1983), Myers and Milton (1991), Jorgensen 
(1993), Wang and Chow (1994), Christensen (1996), Seber and Lee (2003), and 
Hocking (1976, 1985, 2003). 


7.2 THE MODEL 


The multiple linear regression model, as introduced in Section 1.2, can be 
expressed as 


Y= Bo + Bit + Boxy +--+ + Bere te. (7.1) 


We discuss estimation of the 8 parameters when the model is linear in the B’s. An 
example of a model that is linear in the B’s but not the x’s is the second-order 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


137 


138 MULTIPLE REGRESSION: ESTIMATION 


response surface model 
y = Bo + Bix + Box2 + Bax} + Buxs + Bsxix + €. (7.2) 


To estimate the B’s in (7.1), we will use a sample of 1 observations on y and the 
associated x variables. The model for the ith observation is 


yeo= Bo + Bix + Boxi2 free Ht ByXik + Ei, i= 1.2, Parry 2 (7.3) 


The assumptions for ¢; or y; are essentially the same as those for simple linear 
regression in Section 6.1: 


1. E(e;) =0 for i= 1,2,...,n, or, equivalently, E(y;) = By + By xi + Boxi2 + 
sb By xg. 

2. var(e;) = o for i= 1,2,...,n, or, equivalently, var(y;) = oa. 

3. cov(é;, €;) = 0 for all i 4 j, or, equivalently, cov(y,, y;) = 0. 


Assumption | states that the model is correct, in other words that all relevant x’s are 
included and the model is indeed linear. Assumption 2 asserts that the variance of y 
is constant and therefore does not depend on the x’s. Assumption 3 states that the y’s 
are uncorrelated with each other, which usually holds in a random sample (the 
observations would typically be correlated in a time series or when repeated 
measurements are made on a single plant or animal). Later we will add a normality 
assumption (Section 7.6), under which the y variable will be independent as well as 
uncorrelated. 

When all three assumptions hold, the least-squares estimators of the B’s have some 
good properties (Section 7.3.2). If one or more assumptions do not hold, the estima- 
tors may be poor. Under the normality assumption (Section 7.6), the maximum like- 
lihood estimators have excellent properties. 

Any of the three assumptions may fail to hold with real data. Several procedures 
have been devised for checking the assumptions. These diagnostic techniques are 
discussed in Chapter 9. 

Writing (7.3) for each of the n observations, we have 


yi = Bot Bix + Box + +++ + Byxie + €1 


yo = Bo + Byxa1 + Box22 + +++ + ByXox + €2 


Ya = Bo + ByXnt + BoXna + +++ + ByXnk + En- 


7.2 THE MODEL 139 


These n equations can be written in matrix form as 


y1 Lox XQ ee XK Bo | 
y2 1 x21 X20, «ss XD By £2 
= + 
Yn LT Xa Kar cen Xpk By En 
or 
y—XBre. (7.4) 


The preceding three assumptions on ¢; or y; can be expressed in terms of the model in 
(7.4): 


1. E(e) = 0 or E(y) = XP. 
2. cov(é) = ol or cov(y) = 71. 


Note that the assumption cov(e) = oI includes both the previous assumptions 
var(e;) = 0 and cov(é;, ej) = 0. 

The matrix X in (7.4) ism x (kK + 1). In this chapter we assume that n > k + 1 and 
rank (X)=k+ 1. Ifn<k-+ 1 or if there is a linear relationship among the x’s, for 
example, x5 = a Xj /4, then X will not have full column rank. If the values of the 
x;7’s are planned (chosen by the researcher), then the X matrix essentially contains the 
experimental design and is sometimes called the design matrix. 

The 6 parameters in (7.1) or (7.4) are called regression coefficients. To emphasize 
their collective effect, they are sometimes referred to as partial regression coeffi- 
cients. The word partial carries both a mathematical and a statistical meaning. 
Mathematically, the partial derivative of E(y) = By + Bix; + Box. +--+ + Byxx 
with respect to x,, for example, is 6;. Thus 6; indicates the change in E(y) with a 
unit increase in x; when x2, x3,...,X, are held constant. Statistically, B; shows the 
effect of x, on E(y) in the presence of the other x’s. This effect would typically be 
different from the effect of x; on E(y) if the other x’s were not present in the 
model. Thus, for example, Bo and B; in 


y= Bot Bux + Boxy + € 
will usually be different from 66 and Bf in 
y= Bo+ Bix +e’. 
[If x, and x2 are orthogonal, that is, if xix = 0 or if (x; — X1j)'(K2 — X2j) = 0, where 
x, and xX, are columns in the X matrix, then By = 85 and 6B, = 6; see Corollary 1 to 


Theorem 7.9a and Theorem 7.10]. The change in parameters when an x is deleted 
from the model is illustrated (with estimates) in the following example. 


140 MULTIPLE REGRESSION: ESTIMATION 


TABLE 7.1 Data for Example 7.2 


Observation 

Number y xy xX 
1 2, 0 2 
2 3 2 6 
3 2 2 7 
4 7 2 5 
5 6 4 9 
6 8 4 8 
7 10 4 7 
8 7 6 10 
9 8 6 11 
10 12 6 9 
11 11 8 15 
12 14 8 13 


Example 7.2. [See Freund and Minton (1979, pp. 36—39)]. Consider the (contrived) 
data in Table 7.1. 

Using (6.5) and (6.6) from Section 6.2 and (7.6) in Section 7.3 (see Example 
7.3.1), we obtain prediction equations for y regressed on x, alone, on x alone, and 
on both x, and x2: 


$= 1.86 + 1.30%, 
} = 86 + .78x9, 
§ = 5.37 + 3.01x) — 1.2929. 


15 


0 3 6 9 12 15 


Figure 7.1 Regression of y on x2 ignoring x. 


7.3 ESTIMATION OF 6 AND o* 141 


a 

155 = 

e 6 
I, é 4 % 

10 ‘ Partial regression, 
/ x,=8 
a6 

y 6% 
. © \ Partial regression, 
e4 x1=6 
St Sa 
Partial regression, 
x,=4 
Po ; f 
‘ Partial regression, 
: x, =2 
! | 
0 5 10 15 20 


x2 


Figure 7.2 Regression of y on x2 showing the value of x, at each point and partial regressions 
of y on Xp. 


As expected, the coefficients change from either of the reduced models to the full 
model. Note the sign change as the coefficient of x2 changes from .78 to — 1.29. 

The values of y and x, are plotted in Figure 7.1 along with the prediction equation 
y = .86 + .78x2. The linear trend is clearly evident. 

In Figure 7.2 we have the same plot as in Figure 7.1, except that each point 
is labeled with the value of x,;. Examining values of y and x, for a fixed value of 
x, (2, 4, 6, or 8) shows a negative slope for the relationship. These negative relation- 
ships are shown as partial regressions of y on x2 for each value of x,. The partial 
regression coefficient By = —1.29 reflects the negative slopes of these four partial 
regressions. 

Further insight into the meaning of the partial regression coefficients is given in 
Section 7.10. 


7.3. ESTIMATION OF B AND o* 


7.3.1 Least-Squares Estimator for B 


In this section, we discuss the least-squares approach to estimation of the B’s in the 
fixed-x model (7.1) or (7.4). No distributional assumptions on y are required to obtain 
the estimators. 

For the parameters Bp, 8), ...,,, we seek estimators that minimize the sum of 
squares of deviations of the n observed y’s from their predicted values j. By extension 


142 MULTIPLE REGRESSION: ESTIMATION 


of (6.2), we seek Bos B1> sats By that minimize 
YF = So. 59" 
i=l i=l 
= S- (yi Bo Bixi1 Boxi2 met Burin)” (7.5) 


i=l 


Note that the predicted value y; = Bo + Bixit +e: + Berit estimates E(y;), not y,;. A 
better notation would be EQ), but y; is commonly used. 

To obtain the Paes estimators, it is not necessary that the prediction 
equation y; = = By aE Bixit +: fo Binoy, be based on E(y;). It is only necessary to pos- 


tulate an empirical model that is linear in the B’ s, and the least-squares method will 
find the “best” fit to this model. This was illustrated in Figure 6.2. 
To find the values of By, By, ..., 8, that minimize (7.5), we could differentiate )°, a? 


with respect to each B |; and set the results equal to zero to yield k + 1 equations that can 


be solved simultaneously for the B |; 8. However, the procedure can be carried out in more 
compact form with matrix notation. The result is given in the following theorem. 


Theorem 73a. Ify= XB + €, where X isn x (k + 1) of rank k+ 1 <n, then the 
value of B = (Bo, B,,---, By)’ that minimizes (7.5) is 


B= (XX) !X’y. (7.6) 
Proor. Using (2.20) and (2.27), we can write (7.5) as 
e@=)- (;-—x/BY = (y — XB)'(y — XB), (7.7) 
i=1 


where x’ = (1, xj1,..., Xi) is the ith row of X. When the product (y — XB) (y = XB) 
in (7.7) is expanded as in (2.17), two of the resulting four terms can be combined to 
yield 


éé = y'y — 2y'XB + p'X'XB. 


We can find the value of B that minimizes é' é by differentiating é’€ with respect to B 
[using (2.112) and (2.113)] and setting the result equal to zero: 
O#E 
op 


= 0 — 2X’y + 2x/XB = 0, 


This gives the normal equations 


X'XB = X’y. (7.8) 


7.3 ESTIMATION OF 6 AND o* 143 


By Theorems 2.4(iii) and 2.6d(i) and Corollary 1 of Theorem 2.6c, if X is full-rank, 
X’X is nonsingular, and the solution to (7.8) is given by (7.6). 


Since B in (7.6) minimizes the sum of squares in (7.5), B is called the least- 
squares estimator. Note that each B ; in B is a linear function of y; that is, 
6 = avy, where a; is the jth row of (X’X) |X’. This usage of the word linear in 
linear estimator is different from that in linear model, which indicates that the 


model is linear in the B’s. 
We now show that 8 = (X'X)~'X’y minimizes 'é. Let b be an alternative estima- 


tor that may do better than B so that é'é is 
#'& = (y — Xb)'(y — Xb). 


Now adding and subtracting XB, we obtain 


= (y — XB + XB — Xb)'(y — XB + XB — Xb) (7.9) 
= (y — XB)'(y — XB) + (B — b)'X’X(B — b) 
+ 2(B — by (X'y — X’XB). (7.10) 


The third term on the right side of (7.10) vanishes because of the normal equations 
Xy= X'XB in (7.8). The second term is a positive definite quadratic form (assuming 
that X is full-rank; see Theorem 2.6d), and é’é is therefore minimized when b = B. 

To examine the structure of X'X and X’y, note that by Theorem 2.2c(i), the 
(k + 1) x (k+ 1) matrix X’X can be obtained as products of columns of X; similarly, 
X’y contains products of columns of X and y: 


n oxi Yo x2 tet Soe 
are a Yo xi Xi2 snake Yo MiXik 


x’x = > 
So xix Yo Xi Xik Yo Xi2Xik see iM 
ei 
; yo xy; 
Xy= ; 


Yo XK 
If B = (X’X)"!X’y as in (7.6), then 


é=y—-XB=y-Js (7.11) 


144 MULTIPLE REGRESSION: ESTIMATION 


is the vector of residuals, &| = y, — ¥, &2 = Y2 — Yo,---, En = Yn — Yn. The residual 
vector € estimates € in the model y = XB + € and can be used to check the validity 
of the model and attendant assumptions; see Chapter 9. 


Example 7.3.1a. We use the data in Table 7.1 to illustrate computation of B using (7.6). 


2 Gr 39 
3 12 6 
2 LieDe oF 
7 1. 2. 3 
6 14 9 
1 522 162 
8 iy ede- 38 ; 
y= , X= , XX=| 52 395 536 |, 
10 Pts 
102 536 1004 
7 1 6 10 
8 1 6 1 
12 16 9 
ial 1. 8) 45 
14 1 8 13 
90 97476 24290 —.22871 
X'y= | 482], (’Xy!=| 24290 16207 —.11120 |, 
872 —.22871 —.11120 08360 
5.3754 
B= (XX) 'X'y= | 3.0118 
— 1.2855 


Example 7.3.1b. Simple linear regression from Chapter 6 can also be expressed in 
matrix terms: 


YI 1 x1 
y2 1 x2 Bo 
y= ry x= ’ B = ( ), 
By 
Yn 1 XxX 
x’x a n rE: X’y = i 
xi Dee ; xii , 


Iyy-l 1 pare =i 
(XX) nye — (Say Ge 7 : 


7.3. ESTIMATION OF B AND o 145 
Then f, and B; can be obtained using (7.6), B = (X’X)"!X’y: 


B- (>) 7 1 (32:7) (X29) = (x1) (i201) 
By nx — ym —(5;%;) (yi) + NO Xi 


LEAD) 


The estimators Bo and Bi in (7.11) are the same as those in (6.5) and (6.6). 


7.3.2 Properties of the Least-Squares Estimator B 


The least-squares estimator B = (X’X) 'X’y in Theorem 7.3a was obtained without 
using the assumptions E(y) = XB and cov(y) = o7I given in Section 7.2. We merely 
postulated a model y = XB + € as in (7.4) and fitted it. If E(y) # XP, the model 


y = XB + € could still be fitted to the data, in which case, B may have poor proper- 
ties. If cov(y) # o°I, there may be additional adverse effects on the estimator B. 
However, if E(y) = XB and cov(y) = o°I hold, B has some good properties, as 


noted in the four theorems in this section. Note that B is a random vector (from 
sample to sample). We discuss its mean vector and covariance matrix in this 
section (with no distributional assumptions on y) and its distribution (assuming 
that the y variables are normal) in Section 7.6.3. In the following theorems, we 
assume that X is fixed (remains constant in repeated sampling) and full rank. 


Theorem 7.3b. If E(y) = XP, then B is an unbiased estimator for B. 
PROOF : 
E(B) = E((X'X)'X’y] 
=(X'X)|X’E(y) [by (3.38)] 
= (X’X) 'x’xB 
=p. (7.13) 


Theorem 7.3c. If cov(y) = o7I, the covariance matrix for B is given by o°(X'X) |. 
PROOF 
cov(B) = cov[(X’X) | X’y] 
= (X’X) 'X’cov(y)[(X’X) !X’! [by (3.44)] 
= (X'X) | X’/(o DX(X’X)! 
= 0° (X’/X) !X’X(X'X)"! 
= 0° (X’x)!. (7.14) 


146 MULTIPLE REGRESSION: ESTIMATION 


Example 7.3.2a. Using the matrix (X’X)~! for simple linear regression given in 
Example 7.3.1, we obtain 


cov(B) = ( A -_ es ; cov(Bo, ») = o?(X'X)"! 
By cov(By, B;) —_-var(B;) 


= o it re 
ecw | n Mee 


e es x (7.16) 


SS @ae \ =e 4 
Thus 
Br tur o ye n ae fom 
var(By) = Syria var(B,) = Gee: 
& a —0X 
cov(Bo, Bi) = =—— 3 


Di Gi Sg 


We found var(By) and var(B;) in Section 6.2 but did not obtain cov(Bo, Bi ). Note that if 
x > 0, then cov( Bo, B) is negative and the estimated slope and intercept are negatively 
correlated. In this case, if the estimate of the slope increases from one sample to another, 
the estimate of the intercept tends to decrease (assuming the x’s stay the same). 


Example 7.3.2b. For the data in Table 7.1, (X’X) | is as given in Example 7.3.1. 
Thus, cov(B) is given by 


975.243 —.229 
cow(B) = 0° (XX) !=e] 243 162 —.111 
920". =.901. 3084 


The negative value of cov(B,, Bo) = —.111 indicates that in repeated sampling 
(using the same 12 values of x, and x3), Bi and Bo would tend to move in opposite 
directions; that is, an increase in one would be accompanied by a decrease in the 
other. 


In addition to E(B) = Band cov(B) = o°(X'X) |, a third important property of B 
is that under the standard assumptions, the variance of each B '; 1S minimum (see the 
following theorem). 


Theorem 7.3d (Gauss—-Markov Theorem). If E(y) = XB and cov(y) = o°I, the 
least-squares estimators e. j=0,1,...,4, have minimum variance among all 
linear unbiased estimators. 


7.3 ESTIMATION OF 6 AND o* 147 


Proor. We consider a linear estimator Ay of B and seek the matrix A for which Ay is 
a minimum variance unbiased estimator of B. In order for Ay to be an unbiased esti- 
mator of 8, we must have E(Ay) = PB. Using the assumption E(y) = XP, this can be 
expressed as 


E(Ay) = AE(y) = AXB = 8B, 


which gives the unbiasedness condition 
AX =I 


since the relationship AXP = B must hold for any possible value of B [see (2.44)]. 
The covariance matrix for the estimator Ay is given by 


cov(Ay) = A(o’ DA! = 0 AA’. 


The variances of the B ; 8 are on the diagonal of o AA’, and we therefore need to 
choose A (subject to AX = I) so that the diagonal elements of AA’ are minimized. 


To relate Ay to B = (X'X)~!X’y, we add and subtract (X’X)~!X’ to obtain 
AA! = [A — (X’X)~EX! + (XIX) EXLA — (XIX) EX! + XXX]. 


Expanding this in terms of A — (X’X)_'X’ and (X’X)_'X’, we obtain four terms, two 
of which vanish because of the restriction AX = I. The result is 


AA’ = [A — (XX)! X’][A — (XX) EX’) + (’X)!. (7.17) 


The matrix [A — (X’X)~!X’][A — (X’X)~!X’]’ on the right side of (7.17) is positive 
semidefinite (see Theorem 2.6d), and, by Theorem 2.6a (ii), the diagonal elements are 
greater than or equal to zero. These diagonal elements can be made equal to zero by 
choosing A = 6, Oi. (This value of A also satisfies the unbiasedness condition 
AX =I.) The resulting minimum variance estimator of B is 


Ay = (X’X)'X’y, 


which is equal to the least-squares estimator B. 


The Gauss—Markov theorem is sometimes stated as follows. If E(y) = XP and 
cov(y) = o°I, the least-squares estimators Bos Bi bheg By are best linear unbiased 
estimators (BLUE). In this expression, best means minimum variance and linear indi- 
cates that the estimators are linear functions of y. 

The remarkable feature of the Gauss— Markov theorem is its distributional general- 
ity. The result holds for any distribution of y; normality is not required. The only 
assumptions used in the proof are E(y) = XB and cov(y) = o° I. If these assumptions 
do not hold, B may be biased or each ie} may have a larger variance than that of some 
other estimator. 


148 MULTIPLE REGRESSION: ESTIMATION 


The Gauss—Markov theorem is easily extended to a linear combination of the Bs, 
as follows. 


Corollary 1. If E(y) = XB and cov(y) = o°I, the best linear unbiased estimator of 
a’B is a’ B, where B is the least—squares estimator B = (X’X)!X’y. 


Proor. See Problem 7.7. 


Note that Theorem 7.3d is concerned with the form of the estimator B for a given X 
matrix. Once X is chosen, the variances of the B j S are minimized by B = (X’X) | X’y. 
However, in Theorem 7.3c, we have cov(B) = o°(X’X)"! and therefore var(B ) and 


cov(B;, B ;) depend on the values of the x;’s. Thus the configuration of X’X is important 
in estimation of the 6,’s (this was illustrated in Problem 6.4). 

In both estimation and testing, there are advantages to choosing the x’s (or the 
centered x’s) to be orthogonal so that X’X is diagonal. These advantages include mini- 
mizing the variances of the B ; 8 and maximizing the power of tests about the 6,’s 
(Chapter 8). For clarification, we note that orthogonality is necessary but not sufficient 
for minimizing variances and maximizing power. For example, if there are two x’s, 
with values to be selected in a rectangular space, the points could be evenly placed 
on a grid, which would be an orthogonal pattern. However, the optimal orthogonal 
pattern would be to place one-fourth of the points at each corner of the rectangle. 

A fourth property of B is as follows. The predicted value y= Bot 


a a al 
Bix, +--+ Bx, = Bx is invariant to simple linear changes of scale on the x’s, 


where x = (1, x}, x2, ..., xz)’. Let the rescaled variables be denoted by zy = CX, 
j=1,2,...,k, where the c; terms are constants. Thus x is transformed to 
z= (1, c1x1, ..., Cgxy)'. The following theorem shows that } based on z is the same 


as y based on x. 


Theorem 7.3e. If x = d, X15 285 xz) and z= qd, CIN 3: sak, CKXK) then y = 
px= Bz, where B, is the least squares estimator from the regression of y on z. 


Proor. From (2.29), we can rewrite z as z = Dx, where D = diag(1, ci, C2, ..., Cx). 
Then, the X matrix is transformed to Z = XD [see (2.28)]. We substitute Z = XD in 


the least-squares estimator B, = (Z'Z)~'Z'y to obtain 


B. = (Z'Z)"'Z'y = [(XD) (XD)]"|(XDy'y 
= D'(X’x) 'X’y [by (2.49)] 
=D'£, (7.18) 


where B is the usual estimator for y regressed on the x’s. Then 


B.z=(D'ByDx = fx. 


7.3 ESTIMATION OF 6 AND o* 149 


In the following corollary to Theorem 7.3e, the invariance of } is extended to any 
full-rank linear transformation of the x variables. 


Corollary 1. The predicted value y is invariant to a full-rank linear transformation on 
the x’s. 
Proor. We can express a full-rank linear transformation of the x’s as 


/ 


1 0 
Z=XK=(iX0( 4 K ) = (j + X10, j0! + Xi Ki) = (j, X1K)), 
1 


where K, is nonsingular and 


X11 X12 «se Xk 
X21 X22 wee XD 

AGS || “s . |: (7.19) 
Xnl Xn2 oes Xnk 


We partition X and K in this way so as to transform only the x’s in X;, leaving the first 
column of X unaffected. Now B, becomes 


B. = (ZZ) ''Z'y = KB, (7.20) 
and we have 


j= Bz= Bx, (7.21) 


where z = K’x. 


In addition to #, the sample variance s? (Section 7.3.3) is also invariant to changes 
of scale on the x variable (see Problem 7.10). The following are invariant to changes 
of scale on y as well as on the x’s (but not to a joint linear transformation on y and the 
x’s): t statistics (Section 8.5), F statistics (Chapter 8), and R? (Sections 7.7 and 10.3). 


7.3.3 An Estimator for o” 


The method of least squares does not yield a function of the y and x values in the 
sample that we can minimize to obtain an estimator of 07. However, we can devise 
an unbiased estimator for o7 based on the least-squares estimator B. By assumption 
2 following (7.3), 0° is the same for each y;, i = 1,2,..., n. By (3.6), o” is defined by 
o = Ely; — EG], and by assumption 1, we obtain 


EQ) = By + Bix + Borin + +++ + Bern = XB, 
where x; is the ith row of X. Thus a” becomes 


o = Ely; — x,BY. 


150 MULTIPLE REGRESSION: ESTIMATION 


We estimate o* by a corresponding average from the sample 


1 n im 

2, 1 Q2 
,— x py, 7.22 
P= (7.22) 


where n is the sample size and k is the number of x’s. Note that, by the corollary to 
Theorem 7.3d, x; is the BLUE of xB. 
Using (7.7), we can write (7.22) as 


1 . R 
s° =——___(y — XB)'(y — XB) (7.23) 
n—k-1 
_yy-BX'y SSE 
n—-k—-1 n—k—1’ 


(7.24) 


where SSE =(y—XB)'(y —XB)=y’y— B’X’y. With the — denominator 
n—k—1,s? is an unbiased estimator of o”, as shown below. 


Theorem 7.3f. If s* is defined by (7.22), (7.23), or (7.24) and if E(y) = XB and 
cov(y) = o°I, then 


E(s’) = 0°. (7.25) 
Proor. Using (7.24) and (7.6), we write SSE as a quadratic form: 


SSE = y'y — B’X'y =y'y —y'X(X'X) !X’y 
= y [I — X(X’X)'X’Jy. (7.26) 


By Theorem 5.2a, we have 


E(SSE) = tr{ [I — X(X'X)'X’]o° I} 

+ E(y')[I — X(X'X)'X’] Ey) 

= o tr[I — X(X"X)'X’] 

+ p’X’|I— X(X’X)'X’] xB 

= o {n— tr[X(X'X)'x’] } 

+ p'X'XB — B'X'X(X'X) | X'XB 

= o {n — tr[X’X(X’X) ']} 

+ B'X'XB— pB'X'XB [by (2.87)]. 


7.4 GEOMETRY OF LEAST SQUARES 151 


Since X’X is (k + 1) x (k + 1), this becomes 


E(SSE) = o7[n — tre41)] = o7'(n — k — 1). 


Corollary 1. An unbiased estimator of cov(B) in (7.14) is given by 


Cov(B) = s?(X’X)1. (7.27) 


Note the correspondence between n — (k + 1) andy’y — BX’y; there are n terms in 
yy and k + | terms in pX’y = BX’'xB [see (7.8)]. A corresponding property of the 
sample is that each additional x (and B) in the model reduces SSE (see Problem 7.13). 


Since SSE is a quadratic function of y, it is not a best /inear unbiased estimator. 
The optimality property of s’ is given in the following theorem. 


Theorem 7.3. If E(€) = 0, cov(e) = oI, and E(e}) = 30° for the linear model 
y = XB + «, then sin (7.23) or (7.24) is the best (minimum variance) quadratic 
unbiased estimator of o~. 


Proor. See Graybill (1954), Graybill and Wortham (1956), or Wang and Chow 
(1994, pp. 161-163). 


Example 7.3.3. For the data in Table 7.1, we have 
SSE = yy — BXy 
90 
= 840 — (5.3754, 3.0118, —1.2855)| 482 
872 
= 840 — 814.541 = 25.459, 


2 SSE 25.459 


= = = 2.829, 
oe ee ee 


7.4 GEOMETRY OF LEAST SQUARES 


In Sections 7.1—7.3 we presented the multiple linear regression model as the matrix 
equation y = XB + € in (7.4). We defined the principle of least-squares estimation in 
terms of deviations from the model [see (7.7)], and then used matrix calculus and 
matrix algebra to derive the estimators of B in (7.6) and of o in (7.23) and (7.24). 
We now present an alternate but equivalent derivation of these estimators based com- 
pletely on geometric ideas. 


152 MULTIPLE REGRESSION: ESTIMATION 


It is important to clarify first what the geometric approach to least squares is not. In 
two dimensions, we illustrated the principle of least squares by creating a two- 
dimensional scatter plot (Fig. 6.1) of the n points (x1, y1), (2, y2), ---, Qn, Yn). We 
then visualized the least-squares regression line as the best-fitting straight line to 
the data. This approach can be generalized to present the least-squares estimate in 
multiple linear regression on the basis of the best-fitting hyperplane in (k + 1)- 
dimensional space to the n points (411, X12, ---5 Xiks Yi)s (X21, X225 ++ +> X2k, V2), ++ +5 
(Xn1, Xn2, --+>Xnks Yn). Although this approach is somewhat useful in visualizing 
multiple linear regression, the geometric approach to least-squares estimation in 
multiple linear regression does not involve this high-dimensional generalization. 

The geometric approach to be discussed below is appealing because of its math- 
ematical elegance. For example, the estimator is derived without the use of matrix cal- 
culus. Also, the geometric approach provides deeper insight into statistical inference. 
Several advanced statistical methods including kernel smoothing (Eubank and 
Eubank 1999), Fourier analysis (Bloomfield 2000), and wavelet analysis (Ogden 
1997) can be understood as generalizations of this geometric approach. The geo- 
metric approach to linear models was first proposed by Fisher (Mahalanobis 1964). 
Christensen (1996) and Jammalamadaka and Sengupta (2003) discuss the linear stat- 
istical model almost completely from the geometric perspective. 


7.4.1 Parameter Space, Data Space, and Prediction Space 


The geometric approach to least squares begins with two high-dimensional spaces, a 
(k + 1)-dimensional space and an n-dimensional space. The unknown parameter 
vector 8 can be viewed as a single point in (k + 1)-dimensional space, with axes cor- 
responding to the k + | regression coefficients Bg, B;, Bo, ..-, By. Hence we call this 
space the parameter space (Fig. 7.3). Similarly, the data vector y can be viewed as a 


Prediction space 


Parameter space Data space 


Figure 7.3 Parameter space, data space, and prediction space with representative elements. 


7.4 GEOMETRY OF LEAST SQUARES 153 


single point in n-dimensional space with axes corresponding to the n observations. 
We call this space the data space. 

The X matrix of the multiple regression model (7.4) can be written as a partitioned 
matrix in terms of its k + 1 columns as 


X = (j, X1, Xo, X3, ..., Xz). 


The columns of X, including j, are all n-dimensional vectors and are therefore 
points in the data space. Note that because we assumed that X is of rank k + 1, 
these vectors are linearly independent. The set of all possible linear combinations 
of the columns of X (Section 2.3) constitutes a subset of the data space. Elements 
of this subset can be written as 


Xb = doj + b1x1 + boxe + +++ + DEXK, (7.28) 


where b is any k + 1 vector, that is, any vector in the parameter space. This subset 
actually has the status of a subspace because it is closed under addition and scalar 
multiplication (Harville 1997, pp. 28-29). This subset is said to be the subspace gen- 
erated or spanned by the columns of X, and we will call this subspace the prediction 
space. The columns of X constitute a basis set for the prediction space. 


7.4.2 Geometric Interpretation of the Multiple Linear Regression Model 


The multiple linear regression model (7.4) states that y is equal to a vector in the 
prediction space, E(y) = XP, plus a vector of random errors, € (Fig. 7.4). The 


Prediction space 


Parameter space Data space 


Figure 7.4 Geometric relationships of vectors associated with the multiple linear regression 
model. 


154 MULTIPLE REGRESSION: ESTIMATION 


problem is that neither B nor € is known. However, the data vector y, which is not in 
the prediction space, is known. And it is known that E(y) is in the prediction space. 

Multiple linear regression can be understood geometrically as the process of 
finding a sensible estimate of E(y) in the prediction space and then determining 
the vector in the parameter space that is associated with this estimate (Fig. 7.4). 
The estimate of E(y) is denoted as y, and the associated vector in the parameter 
space is denoted as B. 

A reasonable geometric idea is to estimate E(y) using the point in the prediction 
space that is closest to y. It turns out that y, the closest point in the prediction 
space to y, can be found by noting that the difference vector € = y — y must be 
orthogonal (perpendicular) to the prediction space (Harville 1997, p. 170). 
Furthermore, because the prediction space is spanned by the columns of X, the 
point y must be such that € is orthogonal to the columns of X. Using an extension 
of (2.80), we therefore seek y such that 


X'e=0 
or 
X'(y — 9) = X'(y — XB) = X'y — X'XB = 0, (7.29) 
which implies that 
X'XB = X’y. 


Thus, using purely geometric ideas, we obtain the normal equations (7.8) and conse- 
quently the usual least-squares estimator B in (7.6). We can then calculate y as 
XB = X(X'X) 'X’y = Hy. Also, é = y — XB = (1— Dy can be taken as an esti- 
mate of €. Since & is a vector in (n — k — 1)-dimensional space, it seems reasonable 
to estimate o” as the squared length (2.22) of & divided by n — k — 1. In other words, 
a sensible estimator of o* is s? = y/(I — H)y/(n — k — 1), which is equal to (7.25). 


7.5 THE MODEL IN CENTERED FORM 
The model in (7.3) for each y; can be written in terms of centered x variables as 


Yi = Bo + Bix + Boxi2 + +++ + ByXix + &: 
= a+ By%i — X1) + Boriz — X2) + +++ + Bein — Xk) + &i, (7.30) 
i=1,2,...,n, where 
a = Bo + ByxX1 + BoX2 +... + Byxe (7.31) 


and x; = )~"_, x/n,j = 1,2, ...,k. The centered form of the model is useful in 
expressing certain hypothesis tests (Section 8.1), in a search for influential obser- 
vations (Section 9.2), and in providing other insights. 


7.5 THE MODEL IN CENTERED FORM 155 


In matrix form, the centered model (7.30) for y1, y2,..., y, becomes 
. a 
y=(j, (5 ) + €, (7.32) 
1 


where B, = (Bi, Bo,---, By)’, 


X11 —-X, = Xy2—X. ws XK — XK 
1 X21 — Xx] X22 — x2 wae XO Xe 

X.=([I J)X; = : : ; : (7.33) 
Xn — xX] Xn2 — X2 wee Xnk Xk 


and X; is as given in (7.19). The matrix I — (1/n)J is sometimes called the centering 
matrix. 
As in (7.8), the normal equations for the model in (7.32) are 


a 


(j, Xo). xg ) = (j, X0)'y. (7.34) 


By (2.35) and (2.39), the product (j, X.)/(j, X-) on the left side of (7.34) becomes 


j jj jX. 
1 X,. 7 j. X. — j. X, — 
(j, Xc) (j, Xe) (x. JG :) @ XX, 


n 0’ 
= ae (7.35) 


where j’X, = 0’ because the columns of X, sum to zero (Problem 7.16). The right 
side of (7.34) can be written as 


*/ = 
. —{ J _{ 


The least-squares estimators are then given by 
ae (i, XG XO WG, XI = (" We oe: 
B, > J Cc J Cc J Cc y¥ a 0 x'X. Xly 


— (1/n 0’ ny \ y 
“ ( 0 ore ea : Crear: 


a=y, (7.36) 


or 


B, = XX.) X'y. (7.37) 


156 MULTIPLE REGRESSION: ESTIMATION 


These estimators are the same as the usual least-squares estimators B = (X’X) !X’y 
in (7.6), with the adjustment 


By = & — BX) — ByX — +> — Bye =Y— Bix (7.38) 


obtained from an estimator of a in (7.31) (see Problem 7.17). 
When we express y in centered form 


¥ = 44+ Bio —%1) +--+ + Bel — XW), 


it is clear that the fitted regression plane passes through the point (x1, X2, ..., Xx, y). 
Adapting the expression for SSE (7.24) to the centered model with centered y’s, 
we obtain 


SSE = 9-0 - 97 - BL Xy, (7.39) 
i=1 


i= 


which turns out to be equal to SSE = y’y — BX'y (see Problem 7.19). 

We can use (7.36)—(7.38) to express B, and Bo in terms of sample variances and 
covariances, which will be useful in comparing these estimators with those for the 
random-x case in Chapter 10. We first define a sample covariance matrix for the x 
variables and a vector of sample covariances between y and the x’s 


2 
ST S172 «2 SK Syl 
2 
S21 S5 wee SO Sy2 
ell ne pe lbs. Seepeese |? |lhes (7.40) 
2 
Ski Sk2 tess Sk Syk 


where, ce Siz, and sy; are analogous to s” and defined in (5.6) and (5.15); for 


example 


Sxy 


2 don Cin — X2) 


goo, (7.41) 

= ye Oa — XDG@2 — Xa) (7.42) 
n— 1 

oe 1 Or - ae - y) (7.43) 
= 


with x, = i Xi2/n. However, since the x’s are fixed, these sample variances and 
covariances do not estimate population variances and covariances. If the x’s were 
random variables, as in Chapter 10, the s sj, and s,; values would estimate popu- 
lation parameters. 


7.6 NORMAL MODEL 157 


To express B, and Bo in terms of S,,. and s 


yx» We first write S, and s,, in terms of 
the centered matrix X,: 


x’X. 
Sy. = —, (7.44) 

n—-1 

X'y 
eee, 7.45 
5 erg (7.45) 


Note that X’y in (7.45) contains terms of the form yt (xj; — xj)y; rather than 
4 (xj —xX)Oi-—y) as in (7.43). It can readily be shown _ that 
Yo; Oj — Xi — ¥) = D2; ij — X)yi (See Problem 6.2). 

From (7.37), (7.44), and (7.45), we have 


a : x’ XX: =] Xx’, 7 
B, = (n— DIXX) 1 a = ? < :) - a =So hs (7.46) 


and from (7.38) and (7.46), we obtain 
Bo = &— BX =¥—8),S.'x. (7.47) 


XX 


Example 7.5. For the data in Table 7.1, we calculate B , and Bo using (7.46) and (7.47). 


ee 6.4242 8.5455! / 8.3636 
B, = S28 = 
8.5455 12.4545 9.7273 


7 ( 3.0118 ) 
ee estes it i 
Bo =y- s\,.S.% 


4.3333 
= 7.5000 — (3.0118, —1.2855) 


8.5000 
= 7.500 — 2.1246 = 5.3754. 


These values are the same as those obtained in Example 7.3. 1a. 


7.6 NORMAL MODEL 


7.6.1 Assumptions 


Thus far we have made no normality assumptions about the random variables 
Y1, Y2,--+, Yn. To the assumptions in Section 7.2, we now add that 


yisN,(XB, cI) or eisN, (0, oD). 


158 MULTIPLE REGRESSION: ESTIMATION 


Under normality, o;; = 0 implies that the y (or €) variables are independent, as well as 
uncorrelated. 


7.6.2 Maximum Likelihood Estimators for 6 and a 


With the normality assumption, we can obtain maximum likelihood estimators. The 
likelihood function is the joint density of the y’s, which we denote by L(B, oa). We 
seek values of the unknown £ and o that maximize L(p, o°) for the given y and x 
values in the sample. 

In the case of the normal density function, it is possible to find maximum likeli- 
hood estimators B and G? by differentiation. Because the normal density involves a 
product and an exponential, it is simpler to work with In L(B, oa”), which achieves its 
maximum for the same values of B and o as does L(p, O°). 

The maximum likelihood estimators for B and o are given in the following 
theorem. 


Theorem 7.6a. If y is N,(XB, 071), where X is n x (k + 1) of rank k+ 1 <n, the 
maximum likelihood estimators of B and oO are 


B = (XX)! X’y, (7.48) 
1 3 z 
= ~(y — XB) (y — XB). (7.49) 


Proor. We sketch the proof. For the remaining steps, see Problem 7.21. The likeli- 
hood function (joint density of yj, yo,..., yn) iS given by the multivariate normal 
density (4.9) 


LB, ©) = f\Ys By 0°) = <5 PBN 0-XB) 
(2my"/?| 24"? 
1 


= a sip e7 Y-XBY Y-XP)/20°_ (7.50) 
Tr 


[Since the y,’s are independent, L(B, a) can also be obtained as ee FO XB, o).] 
Then InL(B, 07) becomes 


1 
InL(B, 0?) = 5 In (2m) sino 553 ¥ — XBY(y — XB). (7.51) 


Taking the partial derivatives of In L(B, 02) with respect to B and o° and setting the 


results equal to zero will produce (7.48) and (7.49). To verify that B maximizes (7.50) 
or (7.51), see (7.10). 


7.6 NORMAL MODEL 159 


The maximum likelihood estimator B in (7.48) is the same as the least-squares estima- 


tor B in Theorem 7.3a. The estimator G? in (7.49) is biased since the denominator is n 
rather than n — k — 1. We often use the unbiased estimator s” given in (7.23) or (7.24). 


7.6.3 Properties of B and 


We now consider some properties of B and &? (or s*) under the normal model. The 
distributions of B and G are given in the following theorem. 


Theorem 7.6b. Suppose that y is N,,(XB, 071), where X is n x (k + 1) of rank k + 


1 <nand B = (Bp, B,---, B,)'. Then the maximum likelihood estimators B and 6? 
given in Theorem 7.6a have the following distributional properties: 


(i) Bis Newi[B, 7(X’X) 1. 
(ii) né7/o7 is y(n — k — 1), or equivalently, (n — k — 1)s*/o? is x?(n —k — 1). 


(iii) B and 6? (or 5?) are independent. 
PROOF 


(i) Since B = (X'X)"!X’y is a linear function of y of the form B = Ay, where 
A = (X’X)'X’ is a constant matrix, then by Theorem 4.4a(ii), B is 
Nev (B, 07° (X'X)"]. 

(ii) The result follows from Corollary 2 to Theorem 5.5. 

(iii) The result follows from Corollary 1 to Theorem 5.6a. 


Another property of B and G? under normality is that they are sufficient statistics. 
Intuitively, a statistic is sufficient for a parameter if the statistic summarizes all the 


information in the sample about the parameter. Sufficiency of B and @? can be estab- 
lished by the Neyman factorization theorem [see Hogg and Craig (1995, p. 318) or 


Graybill (1976, pp. 69—70)], which states that B and G? are jointly sufficient for B 
and o° if the density f(y;B,o°) can be factored as f(y; B, 0°) = 
(B, o, B, o )h(y), where h(y) does not depend on PB or ao. The following 
theorem shows that B and 6? satisfy this criterion. 


Theorem 7.6c. If y is N,(XB, 071), then B and G? are jointly sufficient for B and o*. 


Proor. The density f(y; B, 07) is given in (7.50). In the exponent, we add and 
subtract XB to obtain 


(y — XB)'(y — XB) = (y— XB + XB — XB) (_y — XB + XB — XB) 
= [(y — XB) + X(B — B)I'[(y — XB) + X(B — B)I. 


160 MULTIPLE REGRESSION: ESTIMATION 


Expanding this in terms of y — XB and x(B — B), we obtain four terms, two of 
which vanish because of the normal equations X’xB = X’y. The result is 


(y — XB)'(y — XB) = (y — XB)'(y — XB) + (B— B)'X’X(B— B) (7.52) 
= no + (B— B)'X'X(B — B). 


We can now write the density (7.50) as 


1 —[né? +(B- B)'X'X(B— B)]/207 
(y; B, 0) = en? + B-BYX'X(B-B)]/20° 
FY B (no2y"? 


which is of the form 


fly; B, &°) = a(B, &, B, Phy), 


where /i(y) = 1. Therefore, by the Neyman factorization theorem, B and G? are 
jointly sufficient for B and o”. 


Note that B and 6? are jointly sufficient for B and o”, not independently sufficient; 
that is, f(y; B, 07) does not factor into the form gi(B B)g2(@, o°)h(y). Also note 
that because s? = na /(n — k — 1), the proof to Theorem 7.6c can be easily modified 
to show that B and s* are also jointly sufficient for B and or, 

Since B and s* are sufficient, no other estimators can improve on the information 
they extract from the sample to estimate B and o7. Thus, it is not surprising that B and 
s° are minimum variance unbiased estimators (each B ; in B has minimum variance). 
This result is given in the following theorem. 


Theorem 7.6d. If y is N,(XB, oD), then B and s* have minimum variance among all 
unbiased estimators. 


Proor. See Graybill (1976, p. 176) or Christensen (1996, pp. 25-27). 


In Theorem 7.3d, the elements of B were shown to have minimum variance among 
all linear unbiased estimators. With the normality assumption added in Theorem 
7.6d, the elements of B have minimum variance among all unbiased estimators. 
Similarly, by Theorem 7.3g, s* has minimum variance among all quadratic unbiased 
estimators. With the added normality assumption in Theorem 7.6d, s* has minimum 
variance among all unbiased estimators. 

The following corollary to Theorem 7.6d is analogous to Corollary | of Theorem 
73d. 


Corollary 1. If y is N,(XB, 071), then the minimum variance unbiased estimator of 


a’ B is a’ B, where B is the maximum likelihood estimator given in (7.48). 


7.7 R? IN FIXED-x REGRESSION 161 
7.7 R? IN FIXED-x REGRESSION 


In (7.39), we have SSE = J”, (y; — y)° — Bi X/y. Thus the corrected total sum of 
squares SST = $3; (y; -— ¥) can be partitioned as 


S> 0; —3Y = B,Xly + SSE, (7.53) 
i=l 
SST = SSR + SSE, 


where SSR = BiX'y is the regression sum of squares. From (7.37), we obtain 
X’y = X'X.8,, and multiplying this by B gives BX!y = BX'X-B,. Then 
SSR = B,X/y can be written as 


SSR = B,X’.X.B, = (X-B,)'(X-B;). (7.54) 


In this form, it is clear that SSR is due to B, = (B;, Bo,---, By)’. 
The proportion of the total sum of squares due to regression is 


3 X'X.B SSR 
ae Bi Pi 8 : (7.55) 
Vier Oi- yy SST 


which is known as the coefficient of determination or the squared multiple corre- 
lation. The ratio in (7.55) is a measure of model fit and provides an indication of 
how well the x’s predict y. 

The partitioning in (7.53) can be rewritten as the identity 


» (; — 3P = yy — ny” = (BX’y — ny’) + (y'y — BX’y) 


i=l 


= SSR + SSE, 


which leads to an alternative expression for R*: 


Aly! x2 
re pxvDs (7.56) 
yy — ny 
The positive square root R obtained from (7.55) or (7.56) is called the multiple cor- 
relation coefficient. If the x variables were random, R would estimate a population 
multiple correlation (see Section (10.4)). 
We list some properties of R? and R: 


1. The range of R? is 0 < R? < 1. If all the B's were zero, except for Bos R? 
would be 0. (This event has probability 0 for continuous data.) If all the 
y values fell on the fitted surface, that is, if y; = ;,i= 1, 2,...,n, then R? 
would be 1. 


162 MULTIPLE REGRESSION: ESTIMATION 


2. R= 1rys; that is, the multiple correlation is equal to the simple correlation [see 
(6.18)] between the observed y,’s and the fitted 4;’s. 


3. Adding a variable x to the model increases (cannot decrease) the value of R?. 
4. If 8B, = B, =--: = B, =9, then 


E(R?) = a (7.57) 


n—- 


Note that the B's will not be 0 when the £,’s are 0. 


5. R? cannot be partitioned into k components, each of which is uniquely attribu- 
table to an x; unless the x’s are mutually orthogonal, that is, 
nA = = . 
Ey Oy — X) Wim — Xm) = O for j A m. 
6. Ris invariant to full-rank linear transformations on the x’s and to a scale change 
on y (but not invariant to a joint linear transformation including y and the x’s). 


In properties 3 and 4 we see that if k is a relatively large fraction of n, it is possible to 
have a large value of R? that is not meaningful. In this case, x’s that do not contribute 
to predicting y may appear to do so in a particular example, and the estimated 
regression equation may not be a useful estimator of the population model. To 
correct for this tendency, an adjusted R’, denoted by R2, was proposed by Ezekiel 
(1930). To obtain R?, we first subtract k/(n — 1) in (7.57) from R? in order to 
correct for the bias when B, = 6B, =...= 6, =O0. This correction, however, 
would make R2 too small when the f’s are large, so a further modification is made 
so that R? = 1 when R* = 1. Thus R? is defined as 


eR -ADe-)_@-DR—-k 
a n—-k—-1 n—-k—-1 — 


(7.58) 


Example 7.7. For the data in Table 7.1 in Example 7.2, we obtain Re by (7.56) and 
R> by (7.58). The values of px’ y and y’y are given in Example 7.3.3. 


_ BX’y— ny? _ 814.5410 — 1207.5) 


R- 
yy — ny” $40. 1277, 5° 
139.5410 
= ——__ = 8457 
165.0000 ; 
= 2 = 
2 TDR =k _ UD(8457)-2 _ gy 
e n—-k—-1 9 


Using (7.44) and (7.46), we can express R? in (7.55) in terms of sample variances 
and covariances: 


R BIXIXB, — SSu(2— DSuS8z'sx 885! Sr 


et Oi =? yy am, Qi = yy” s 


(7.59) 


7.7 R? IN FIXED-x REGRESSION 163 


Prediction space 


Data space 


Figure 7.5 Multiple correlation R as cosine of 6, the angle between y — yj and y — yj. 


This form of R? will facilitate a comparison with R? for the random-x case in Section 
(10.4) [see (10.34)]. 

Geometrically, R is the cosine of the angle 6 between y and y corrected for their 
means. The mean of 7, j2,..-., ¥n is y, the same as the mean of y1, y2,..., Yn (see 
Problem 7.30). Thus the centered forms of y and y are y — yj and y — yj. The 
angle between them is illustrated in Figure 7.5. (Note that yj is in the estimation 
space since it is a multiple of the first column of X.) 

To show that cos@ is equal to the square root of R’ as given by (7.56), we use 
(2.81) for the cosine of the angle between two vectors: 


cos 6 = Sz SO =, (7.60) 
Vig — xy — WIIG —ID'G — DI 


To simplify (7.60), we use the identity y — yj = (y — yj) + (y — y), which can also 
be seen geometrically in Figure 7.5. The vectors y — yj and y — y on the right side 
of this identity are orthogonal since y — yj is in the prediction space. Thus the numer- 
ator of (7.60) can be written as 


Y—wWO-WD=l9-W+y-HN'G-wW 
= (9 —y'O-WD+9-NO-wW 
= (9 — yi)’ — yi) + 0. 


Then (7.60) becomes 


Y= yp'GO — yD _ R, (7.61) 
(y — spy — 


cos 6 = 


164 MULTIPLE REGRESSION: ESTIMATION 


which is easily shown to be the square root of R? as given by (7.56). This is 
equivalent to property 2 following (7.56): R = rs. 
We can write (7.61) in the form 


_ De Gi- 3 _ SSR 


~ EGi= 9 SST’ 


R 


in which SSR = Yo, (Gi — ¥)’ is a sum of squares for the };’s. Then the partitioning 
SST = SSR + SSE below (7.53) can be written as 


So -W => 6 -W + > - I, 
i= i= i= 


which is analogous to (6.17) for simple linear regression. 


7.8 GENERALIZED LEAST SQUARES: cov(Y) = o°V 


We now consider models in which the y variables are correlated or have differing var- 
iances, so that cov(y) # o°L. In simple linear regression, larger values of x; may lead to 
larger values of var(y;). In either simple or multiple regression, ify), y2,..., yy occur at 
sequential points in time, they are typically correlated. For cases such as these, in which 
the assumption cov(y) = o7I is no longer appropriate, we use the model 


y=—XB+e, Ely)=XB, covly)=>=C°V, (7.62) 


where X is full-rank and V is a known positive definite matrix. The usage } = 0° V 
permits estimation of o” in some convenient contexts (see Examples 7.8.1 and 7.8.2). 
n 


5) elements above (or below) the 


Then x n matrix V has n diagonal elements and ( 


n 
2 
mated from a sample of n observations. In certain applications, a simpler structure 
for V is assumed that permits estimation. Such structures are illustrated in 
Examples 7.8.1 and 7.8.2. 


diagonal. If V were unknown, these ( ) +n distinct elements could not be esti- 


7.8.1 Estimation of B and o” when cov(y) = o’V 


In the following theorem we give estimators of B and o” for the model in (7.62). 


7.8 GENERALIZED LEAST SQUARES: cov(Y) = 0°V 165 


Theorem 7.8a. Let y = XB + ¢, let E(y) = XB, and let cov(y) = cov(e) = o° V, 
where X is a full-rank matrix and V is a known positive definite matrix. For this 
model, we obtain the following results: 


(i) The best linear unbiased estimator (BLUE) of B is 
B= (X'V'X)  X'v-ly. (7.63) 
(ii) The covariance matrix for B is 
cov(B) = 0° (X’V-!X)!. (7.64) 
(iii) An unbiased estimator of o is 


2 Y= XBIV "Wy — XB) 
n—-k—-1 
Ny-l_ewy x(x’ ly ly’ 1 
_yl aN ale nee 
n—k— 


(7.65) 


where B is as given by (7.63). 


Proor. We prove part (i). For parts (ii) and (iii), see Problems (7.32) and (7.33). 


1. Since V is positive definite, there exists ann x n nonsingular matrix P such that 
V = PP’ (see Theorem 2.6c). Multiplying y= XB+¢ by P™!, we obtain 
P-'y = P''xXB+ P"'e, for which E(P~'e) = P-'E(e) = P-'0 = 0 and 


cov(P~!¢) = P7!cov(e)(P!Y [by (3.44)] 
=P 'evP!y =P | PPP’) | = ol. 
Thus the assumptions for Theorem 7.3d are satisfied for the model P~!y = 


P-'xB +P-'e, and the least-squares estimator B = [(P'xy (RP! Xyy! 
(P~'X)'P~'y is BLUE. Using Theorems 2.2b and 2.5b, this can be written as 


B= [x’(P yp 'X] 'X'(P yp ly 


= [X'(P'y'P TX] 'X"(P’) Ply [by (2.48) 
= [X’(PP’)!x]-'x’(PP’) ly [by (2.49)] 
= (X'V-' XxX) x’vvly. 


166 MULTIPLE REGRESSION: ESTIMATION 


Note that since X is full-rank, X/V~!X is positive definite (see Theorem 2.6b). The 


estimator B = (X'V~!X)-!X’/V~'y is usually called the generalized least-squares 
estimator. The same estimator is obtained under a normality assumption. 


Theorem 7.8b. If y is N,(XB, o°V), where X is full-rank and V is a known positive 
definite matrix, where X is n x (k + 1) of rank k + 1, then the maximum likelihood 
estimators for B and oO are 


B= (X'V'IX)'X'V!y, 


‘2 1 sige Se i 
o =—(y— XPV '(y — XB). 
Proor. The likelihood function is 


eo Y-XB) (PV) | (y—XB)/2. 


L(B, 0”) 


1 
Qm)"|e2v |? 


By (2.69), |o°V| = (o7)"|V|. Hence 


L(B, 0°) o-W-XBV"l(y-XB)/20?. 


1 
(2ar0?y"/?|y|/? 


The results can be obtained by differentiation of In L(B, o~) with respect to B and 
with respect to 0°. 


We illustrate an application of generalized least squares. 


Example 7.8.1. Consider the centered model in (7.32) 


Ate x to 
y=(j, (gts 


with covariance pattern 


Y=cC7[— pl+ pI] = o0V (7.67) 
lp... p 
pl p 
p p 1 


in which all variables have the same variance o* and all pairs of variables have the 
same correlation p. This covariance pattern was introduced in Problem 5.26 and is 
assumed for certain repeated measures and intraclass correlation designs. See 
(3.19) for a definition of p. 


7.8 GENERALIZED LEAST SQUARES: cov(Y) = 0°V 167 


By (7.63), we have 


p= ( rs ) = (X'V-'X) I x’v'ly. 
B, 
For the centered model with X = (j, X,), the matrix X’V—!X becomes 
j 
X'v'x = @ )vG. X.) 
7 jv j jv ‘xX, 
AX vp xiv x, ] 
The inverse of the n x n matrix V = (1 — p)I+ pJ in (7.67) is given by 


v-! = ad — bpS), (7.68) 


where a= 1/(1—p) and b=1/[1+(n—1)p]. Using V7! in (7.68), X’V-'X 
becomes 


Ny-ly bn 0’ 
xX'V x=(% aX'X, }' (7.69) 
Similarly 
bny 
feral y 
XV y= ee) (7.70) 


We therefore have 


@ \ _ pyry-lyy-Ixty-ly — d 
(fj, ) = ev orixw y= (guy igey } 


which is the same as (7.36) and (7.37). Thus the usual least-squares estimators are 
BLUE for a covariance structure with equal variances and equal correlations. 


7.8.2 Misspecification of the Error Structure 


Suppose that the model is y = XB + & with cov(y) = o°V, as in (7.62), and we mis- 
takenly (or deliberately) use the ordinary least-squares estimator B = (XX)! X’y in 
(7.6), which we denote here by B to distinguish it from the BLUE estimator 
B -_ (X’V- IX) x'v—ly in (7.63). Then the mean vector and covariance matrix 


168 MULTIPLE REGRESSION: ESTIMATION 


for B* are 


E(B’) = B, (7.71) 
cov(B*) = 0° (X'X) 1 X’/VX(X'X) 1. (7.72) 


Thus the ordinary least-squares estimators are unbiased, but the covariance matrix 
differs from (7.64). Because of Theorem 7.8a(i), the variances of the B;’s in (7.72) 


cannot be smaller than the variances in cov(B) = 0°?(X’V-!X)! in (7.64). This is 
illustrated in the following example. 


Example 7.8.2. Suppose that we have a simple linear regression model 
yi = Bo + Bix; + €;, where var(y;) = 07x; and cov(y;, yj) = 0 fori # j. Thus 


xX 0 0 
9) 0 

cov(y) = eV=c0 ; 
0 O Xn 


This is an example of weighted least squares, which typically refers to the case where 
V is diagonal with functions of the x’s on the diagonal. In this case 


1 X| 

1 Xx. 
x= Aili 

1 xX 


and by (7.63), we have 


B= a = (X/V-IX)-Ex'voly 


1 


I (Shi) (O24) - Dh» 


= . (7.73) 

a : Ka z n n ny; 

(S141) (x74) coda (iyi) (12) = DIE 

The covariance matrix for B is given by (7.64): 
cov(B) = 0° (X’V~! xX)! 
Xj —n 
ye De 

= : (7.74) 


7.9 MODEL MISSPECIFICATION 169 


If we use the ordinary least-squares estimator B = (X’X) 'X’y as given in (6.5) 
and (6.6) or in (7.12) in Example 7.3.1b, then cov(B*) is given by (7.72); that is, 


cov(B") = 0?(X'X)"!X'VX(X'X)! 
no SOX; ' ea ee i SG - 


=o 
xi ye it ie ix Nae 


Di CL — Lez) AY — 0 LiF 
Soe 


ODE aE — nix it we iG — 2a im og + Ci? 
(7.75) 


2 i 
where c = 1/ [n 37 — (30) ’ . The variance of the estimator 8; is given by the 
lower right diagonal element of (7.75): 
‘x n? S093 — 2nS>. x; S022 + ( xj) 
var(B+) — oe a i > > i : (di 
In i — (x) 


and the variance of the estimator B, is given by the corresponding element of (7.74): 


, (7.76) 


» (1 /x;) 
=o dil) j 1.71 
var(B,) yo Xi ye (1/x) eo ( ) 
Consider the following seven values of x: 1, 2, 3, 4, 5, 6, 7. Using (7.76), we obtain 


var(B*) = .14290?, and from (7.77), we have var(8,) = .10990?. Thus for 
these values of x, the use of ordinary least squares yields a slope estimator with a 
larger variance, as expected. 


Further consequences of using a wrong model are discussed in the next section. 


7.9 MODEL MISSPECIFICATION 


In Section 7.8.2, we discussed some consequences of misspecification of cov(y). We 
now consider consequences of misspecification of E(y). As a framework for discus- 
sion, let the model y = Xf + € be partitioned as 


y=xp+e=(%1,X9(4) +e 
B, 


=X, B,+ XP) +e. (7.78) 


170 MULTIPLE REGRESSION: ESTIMATION 


If we leave out X»B, when it should be included (i.e., when B, 4 0), we are under- 
fitting. If we include X2B, when it should be excluded (i.e., when B, = 0), we are 
overfitting. We discuss the effect of underfitting or overfitting on the bias and the 


variance of the Be }, and s* values. 
We first consider estimation of 8, when underfitting. We write the reduced model as 


y=X6i+e, (7.79) 


using B; to emphasize that these parameters (and their estimates B) will be different 


from B, (and B,) in the full model (7.78) (unless the x’s are orthogonal; see Corollary 
1 to Theorem 7.9a and Theorem 7.10). This was illustrated in Example 7.2. In the fol- 


lowing theorem, we discuss the bias in the estimator B obtained from (7.79) and give 


the covariance matrix for B}. 


Theorem 7.9a. If we fit the model y = X;f; + €* when the correct model is 
y = X|B, + X2B, + € with cov(y) = o°I, then the mean vector and covariance 
matrix for the least-squares estimator B: = (X,X1) 'Xty are as follows: 


(i) E(Bi) = B, + AB,, where A = (X,X1) 1X) X, (7.80) 
(ii) cov(B*) = o-(X',X1) 1. (7.81) 
PROOF 


(i) E(Bi) = El(X,X))!X)y] = (X,X1) XE) 
= (X{X1) |X) (XB; + XoB) 
= B, + (X,X1)'X) Xap. 
(ii) cov(B}) = cov[(X, Xi) 'Xiy] 
= (XX1)'X\(@ DXi (XK, Xi)! [by (3.44)] 


= 0°(XX,)'. 


Thus, when underfitting, B; is biased by an amount that depends on the values of thex’s 
in both X; and X,. The matrix A = (XX1) 1X} Xp» in (7.81) is called the alias matrix. 


Corollary 1. If XX2 =, that is, if the columns of X; are orthogonal to the 
columns of Xo, then B is unbiased: E(Bi) = B,. 


In the next three theorems, we discuss the effect of underfitting or overfitting on y, s*,and 
the variances of the B;’s. In some of the proofs we follow Hocking (1996, pp. 245-247). 
Let xo = (1, x01, X02,.--, Xox)’ be a particular value of x for which we desire to 
estimate E(yo) = x)B. If we partition x) into (x6), X)) corresponding to the 


7.9 MODEL MISSPECIFICATION 171 


partitioning X = (X;, Xz) and f’ = (6%, B)), then we can use either Sp = x48 or 
Jo1 = Xo, Bj to estimate xj B. In the following theorem, we consider the mean of fo). 


Theorem 7.9b. Let si: = xh, Bj, where By 1 = (XX)! X\y. Then, if B, 4 0, we 
obtain 

E(X), Bi) = X\(B: + AB>), (7.82) 

= xB — (X02 — A’X01)'By # XB. (7.83) 


Proor. See Problem 7.43. 


In Theorem 7.9b, we see that, when underfitting, x/, By is biased for estimating x/,B. 
[When overfitting, xB is unbiased since E(x,B) = x) B = xp, Bi + Xo) Bo. which is 


equal to xp, B, if B, = 0.] 
In the next theorem, we compare the variances of B and Bj, where B is from Bi 


and Bj i is from B,. We also compare the variances of x), Bi and x) B. 

Theorem 7.9c. Let B= (X’X)!X’y from the full model be partitioned as 

3 = ( A ) ,and let B* = (X',X,)~|X/y be the estimator from the reduced model. Then 
2 


(i) cov(B;) _ cov(B*) = 0 AB'!A’, which is a positive definite matrix, where 
= (X/X,)'X/X2 and B = X5X) — X}X,A. Thus var(B;) > var(B*). 
(i) var(x} B) > var(x), Bi). 


PROOF 


(i) Using X'X partitioned to conform to X = (X;, X2), we have 


! ! =1 
cov(B) = cov B = 0° (XX) l= Zee Z.) 
B XX, XX, 


_ & <a ae G! g? 

- Gx, Gn 7 G? G?]’ 
where Gj = XX; and G! is the corresponding block of the partitioned inverse 
matrix (X’X)"!. Thus cow(B;)=0G''. By (2.50), G!!=G)/+ 


Gy! GyBo'G21G;), where B = Goo — GoiGy!Gy2. By (7.81), cov(Bi) = 
o°(XX,) | = 0° G;]. Hence 


cov(B,) — cov(B') = 0° (G" — GF!) 
= o (Gi + Gy GiB "GiGi oe Gi!) 


= 0 AB!A’. 


172 MULTIPLE REGRESSION: ESTIMATION 
(ii) var(x),B) = 0°x)(X’X)!xo 
2 2 ; : Gi G2 Xol 
= (Xo1, Xo) 21 22 
G G X02 
= 07 (x, G' xo) + x, GS? x02 + xp) G7! x01 + xp) G7?X02). 


Using (2.50), it can be shown that 


var(x), 8) — var(x), BY) = 07 (X02 — A’Xo1)'G??(xo2 — A’Xo1) > 0 


because G” is positive definite. 


By Theorem 7.9c(i), var(B;) in the full model is greater than var(B*) in the reduced 
model. Thus underfitting reduces the variance of the B;’s but introduces bias. On the 
other hand, overfitting increases the variance of the 67's. In Theorem 7.9c (ii), var(so) 
based on the full model is greater than var(¥;) based on the reduced model. Again, 
underfitting reduces the variance of the estimate of E(y 9) but introduces bias. 
Overfitting increases the variance of the estimate of E(y,). 


We now consider s” for the full model and for the reduced model. For the full model 
y= XB+ €= XB, + XoB, + €, the sample variance s* is given by (7.23) as 


pp — Y= XBy — XB) 
n-k—-1 , 


In Theorem 7.3f, we have E(s*) = o~. The expected value of s” for the reduced model 
is given in the following theorem. 


Theorem 7.9d. If y = XB + € is the correct model, then for the reduced model 
y =X, 6, + €* (underfitting), where X; is n x (p+ 1) with p <k, the variance 
estimator 


2_ Y¥~Xi By — XB) 


7.84 
*1 n—p-1 Gea) 


has expected value 


X} [I — X1(X,X1)/X)]X2B> 


Ro B, 
E(st) = 0? + rea Ter 


(7.85) 


Proor. We write the numerator of (7.84) as 


SSE; = y'y — BiXiy = yy — y’Xi(XXi) | Xiy 
= y [I — X\(X{X,) | Xi ly. 


7.9 MODEL MISSPECIFICATION 173 


Figure 7.6 Straight-line fit to a curved pattern of points. 


Since E(y) = XB by assumption, we have, by Theorem 5.2a, 
E(SSE,) = tr{[I — X)(XX,)7'X) Jo?) + p’X' — X\(X,X1)'X 1XB 
= (n—p—l)o? + B,XS[ — Xi(X, Xi)! X) 1X2B, 
(see Problem 7.45). 


Since the quadratic form in (7.85) is positive semidefinite, s* is biased upward when 
underfitting (see Fig. 7.6). We can also examine (7.85) from the perspective of over- 
fitting, in which case B, = 0 and s? is unbiased. 

To summarize the results in this section, underfitting leads to biased B;'s, biased 


j’s, and biased s*. Overfitting increases the variances of the B's and of the y’s. We 


are thus compelled to seek an appropriate balance between a biased model and one 
with large variances. This is the task of the model builder and serves as motivation 
for seeking an optimum subset of x’s. 


Example 7.9a. Suppose that the model y; = 65 + B)x; + €; has been fitted when the 
true model is y; = By + B,x; + Box? + ¢;. (This situation is similar to that illustrated 
in Figure 6.2.) In this case, Bi, Br, and a would be biased by an amount dependent 
on the choice of the x;’s [see (7.80) and (7.86)]. The error term é; in the misspecified 
model y; = 8) + Bjxi + €} does not have a mean of 0: 


E(e;) = EQi — Bo — Bix) 
= Ey) — Bo — Bix 
= By + Bix; + Box? — By — Bix 
= By — By + (By — Bai + Box%- 


174 MULTIPLE REGRESSION: ESTIMATION 


x 


Figure 7.7 No-intercept model fit to data from an intercept model. 


Example 7.9b. Suppose that the true model is y; = By + B,x; + €; and we fit the 
model y; = 8} x; + «7, as illustrated in Figure 7.7. 
For the model y; = 6}x; + 7, the least-squares estimator is 


je Dain XV 
= eee 7.86 
aes 2. ie 


(see Problem 7.46). Then, under the full model y; = By + B)x; + €;, we have 
a, 1 
E(B)) = Se D0) 
1 
7 Fy Bo + Bix) 


= =z (>a) 


iti 


oe ie 


+ Bi. (7.87) 


Thus Bi is biased by an amount that depends on Bp and the values of the x’s. 


7.10 ORTHOGONALIZATION 


In Section 7.9, we discussed estimation of 8; in the model y = X; 8; + &* when the true 
model is y = X; B, + X2B, + €. By Theorem 7.9a, E(B’) = B, + (X)X1) |X) Xoo, 


7.10 ORTHOGONALIZATION 175 


so that estimation of B, is affected by the presence of X>, unless X',X 2 = O, in which 


case, E(B.) = fB,. In the following theorem, we show that if X//X2 = O, the estimators 
of B; and B, not only have the same expected value, but are exactly the same. 


Theorem 7.10. If X' Xo =O, then the estimator of 6, in the full model 
y = X,B, + XB, + € is the same as the estimator of B} in the reduced model 


y=Xip, +e". 


Proor. The least-squares estimator of B} is B — (x X,) 1X) y. For the estimator of 
B, in the full model, we partition B = (X’X)~!X’y to obtain 


(A) - 3B)" 
B, XiX1_ X)X2 Xiy } 


Using the notation in the proof of Theorem 7.9c, this becomes 
B, 5 es eae ee 
B G2, Gy Xsy 
7 G! 1 G? ( Xiy ) 
G?! G” Xby 
B, = G"Xiy + GPXiy 


= (Gi} + Gj} G2B'G21G7)X\y — Gy GB 'Xy, 
where B = Go — GG; Gy. If Gp = X' Xo = O, then B, reduces to 


By (2.50), we obtain 


B, = G7) Xiy = (XX) Xty, 


which is the same as Bi. 


Note that Theorem 7.10 will also hold if X; and X are “essentially orthogonal,” that 
is, if the centered columns of X; are orthogonal to the centered columns of Xp. 

In Theorem 7.9a, we discussed estimation of PB} in the presence of B, when 
X/,X2. # O. We now consider a process of orthogonalization to give additional 
insights into the meaning of partial regression coefficients. 

In Example 7.2, we illustrated the change in the estimate of a regression coefficient 
when another x was added to the model. We now use the same data to further examine 
this change.The prediction equation obtained in Example 7.2 was 


y = 5.3754 + 3.0118x, — 1.2855x2, (7.88) 


176 MULTIPLE REGRESSION: ESTIMATION 


and the negative partial regressions of y on x. were shown in Figure 7.2. By means of 
orthogonalization, we can give additional meaning to the term — 1.2855x.. In order to 
add x, to the prediction equation containing only x,, we need to determine how much 
variation in y is due to x after the effect of x, has been accounted for, and we must 
also correct for the relationship between x, and x2. Our approach is to consider the 
relationship between the residual variation after regressing y on x, and the residual 
variation after regressing x2 on x,;. We follow a three-step process. 


1. Regress y on x;, and calculate residuals [see (7.11)]. The prediction equation is 
y = 1.8585 + 1.3019x,, (7.89) 


and the residuals y; — y,(x,) are given in Table 7.2, where ¥;(x;) indicates that » 
is based on a regression of y on x; as in (7.89). 


2. Regress x2 on x; and calculate residuals. The prediction equation is 
X_ = 2.7358 + 1.3302x1, (7.90) 


and the residuals x2; — X2;(x,) are given in Table 7.2, where X2;(x;) indicates that 
Xz has been regressed on x, as in (7.90). 


3. Now regress y — ¥(x,) on x2 — X2(x,), which gives 
y — 5 = -1.2855(x — 4). (7.91) 


There is no intercept in (7.91) because both sets of residuals have a mean of 0. 


TABLE 7.2 Data from Table 7.1 and Residuals 


y xy X2 y — yx) X2 — X2(x1) 
2 0 2 0.1415 — 0.7358 
3 2 6 — 1.4623 0.6038 
2 2 a, —2.4623 1.6038 
7 2 5 2.5377 — 0.3962 
6 4 9 — 1.0660 0.9434 
8 4 8 0.9340 — 0.0566 
10 4 7 2.9340 — 1.0566 
7 6 10 — 2.6698 —0.7170 
8 6 11 — 1.6698 0.2830 
12 6 9 2.3302 — 1.7170 
11 8 15 — 1.2736 1.6226 
14 8 13 1.7264 —0.3774 


7.10 ORTHOGONALIZATION 177 


In (7.91), we obtain a clearer insight into the meaning of the partial regression coef- 
ficient — 1.2855 in (7.88). We are using the “unexplained” portion of x2 (after x, is 
accounted for) to predict the “unexplained” portion of y (after x; is accounted for). 

Since x2 — X2(x;) is orthogonal to x, [see Section 7.4.2, in particular (7.29)], fitting 
y — W(x) to x2 — X2(1) yields the same coefficient, — 1.2855, as when fitting y to x, 
and xz together. Thus — 1.2855 represents the additional effect of x2 beyond the effect 
of x, and also after taking into account the overlap between x, and x, in their effect on 
y. The orthogonality of x; and x2 — X2(x;) makes this simplified breakdown of effects 
possible. 

We can substitute }(x,) and X2(x;) in (7.91) to obtain 


y—9 =H, x2) — Hy) = — 1.2855 [x2 — Hoa) I, 


or 


} — (1.8585 + 1.3019x;) = —1.2855[x9 — (2.7358 + 1.3302x))], (7.92) 


which reduces to 


y = 5.3754 + 3.0118x, — 1.2855x2, (7.93) 


the same as (7.88). If we regress y (rather than y — ¥) on x2 — X2(x,), we will still 
obtain —1.2855x2, but we will not have 5.3754 + 3.0118x,. 

The correlation between the residuals y — $(x,) and x2 — X2(x;) is the same as the 
(sample) partial correlation of y and x2 with x, held fixed: 


Tyna = Vy-5, ka: (7.94) 


This is discussed further in Section 10.8. 
We now consider the general case with full model 


y=XiB, + XP,+e 


and reduced model 


y=Xipi +e. 


We use an orthogonalization approach to obtain an estimator of B,, following the 
same three steps as in the illustration with x, and xz above: 


1. Regress y on X, and _ calculate’ residuals y—y(X,), where 
§(X1) = Xi Bi = X\(X.X1) Xy [see (7.11). 

2. Regress the columns of X, on X; and obtain residuals Xo.) = X2 — X>(X)). If 
X» is written in terms of its columns as Xz = (X21,..., X2j,..., X2p), then the 


178 


MULTIPLE REGRESSION: ESTIMATION 
regression coefficient vector for x; on X; is bj = (X,X1) 'X) Xj, and 
Xj = Xib; = X1(XX1) 1X) x9). For all columns of X5, this becomes 
X2(X1) = Xi (XX) XX. = XiA, where A = (XX) XX) is the alias 
matrix defined in (7.80). Note that Xo.) = Xz — X5(X)) is orthogonal to Xj: 
X) X21 = O. (7.95) 


Using the alias matrix A, the residual matrix can be expressed as 


Xo = Xp — X2(X)) (7.96) 
=X = XiGUX) XX SH y= RA (7.97) 


. Regress y — y(X)) on Xo.1 = X2 — X>(X)). Since X».; is orthogonal to Xj, we 


obtain the same B as in the full model y = X; B, + Xo pp. Adapting the nota- 
tion of (7.91) and (7.92), this can be expressed as 


9(X1, Xo) — H(X1) = Xo fh. (7.98) 


If we substitute 9(X;) = XB} and Xj, = X2—Xj,A into (7.98) and use 
B. = B, + AB, from (7.80), we obtain 


9(X1, X) = Xi Bi + (K. — Xi AB, 
= Xi(B, + AB) + (X% — XiA)B, 


= XB; + XB, 


which is analogous to (7.93). This confirms that the orthogonality of X; and X>, 


leads to the estimator B, in (7.98). For a formal proof, see Problem 7.50. 


PROBLEMS 
7.1 Show that S™”_, (y; — x/B)* = (y — XB)/(y — XB), thus verifying (7.7). 
7.2 Show that (7.10) follows from (7.9). Why is X'X positive definite, as noted 
below (7.10)? 
7.3 Show that Bo and By in (7.12) in Example 7.3.1 are the same as in (6.5) and 


(6.6). 


7.4 
75 


7.6 


7.7 


PROBLEMS 179 


Obtain cov(B) in (7.16) from (7.15). 


Show that var(By) = 0 (30,37 /n)/ Di - x) in (7.16) in Example 7.3.2a 
is the same as var(Bp) in (6.10). 


Show that AA’ can be expressed as AA’ =[A— (X’X)!X’] 
[A — (X/X)~!X’]! + (XX)! as in (7.17) in Theorem 7.3d. 


Prove Corollary 1 to Theorem 7.3d in the following two ways: 


(a) Use an approach similar to the proof of Theorem 7.3d. 
(b) Use the method of Lagrange multipliers (Section 2.14.3). 


Show that if the x’s are rescaled as z; = cjxj, j = 1, 2, ..., k, then B. ct DB, 
as in (7.18) in the proof of the Theorem 7.3e. 
Verify (7.20) and (7.21) in the proof of Corollary 1 to Theorem 7.3e. 


Show that s? is invariant to changes of scale on the x’s, as noted following 
Corollary 1 to Theorem 7.3e. 


Show that (y — XB)'(y — XB) = y'y — B’X’y as in (7.24). 

Show that E(SSE)=o?(n—k-—1), as in Theorem 7.3f, using the 
following approach. Show that SSE=y'y— B'X'XB. Show _ that 
E(y'y) = no? + B’X’XB and that E(B’X'XB) = (k + 1)o? + p’X'XB. 

Prove that an additional x reduces SSE, as noted following Theorem 7.3f. 


Show that the noncentered model preceding (7.30) can be written in the cen- 
tered form in (7.30), with @ defined as in (7.31). 

Show that X, = [I — (1/n)J]X; as in (7.33), where X; is as given in (7.19). 
Show that j’X, = 0’, as in (7.35), where X, is the centered X matrix defined in 
(7.33). 

Show that the estimators @ = y and B, = (X/X,)'X‘y in (7.36) and (7.37) 
are the same as B = (X’x) |x’ y in (7.6). Use the following two methods: 


(a) Work with the normal equations in both cases. 
(b) Use the inverse of x’'x in partitioned form: 
(XX)! = [(j, XG. XDI. 


Show that the fitted regression plane ) = @ + Bin —X)+-0-+ Bre — Xx) 
passes through the point (%1,X2, ..., Xz, ¥), aS noted below (7.38). 


Show that SSE=)7,(;—y) — Bi X!y in (7.39) is the same as 
SSE = y'y — B’X’y in (7.24). 


180 


7.20 


7.21 


7.22 
7.23 


7.24 


7.25 


7.26 


7.27 


7.28 


7.29 


7.30 
7.31 


MULTIPLE REGRESSION: ESTIMATION 


(a) Show that S,, = X!.X./(n — 1) as in (7.44). 
(b) Show that s,, = Xy/(n — 1) as in (7.45). 


(a) Show that if y is N,(XB, oI), the likelihood function is 


1 1 
L(B, ©) = eo 9 XB)Y-XB)/20?_ 


as in (7.50) in the proof of Theorem 7.6a. 

(b) Differentiate InZ(B,o7) in (7.51) with respect to B to obtain 
B = (X'X)"!X’y in (7.48). 

(c) Differentiate In L(B, 07) with respect to o” to obtain 6? = (y- Xp) (y-— 
XB)/n as in (7.49). 


Prove parts (ii) and (iii) of Theorem 7.6b. 


Show that (y — XB)'(y — XB) = (y — XB)(y — XB) + (B — BYX’X(B — B) 
as in (7.52) in the proof of Theorem 7.6c. 


Explain why f(y; B, 07) does not factor into gi(B B)g2(G, o7)h(y), as noted 
following Theorem 7.6c. 

Verify the equivalence of (7.55) and (7.56); that is, show that 
BX'y — ny” = BLX.X.B,. 

Verify the comments in property | in Section 7.7, namely, that if 
B, = Bb =--- = B, = 0, then R* = 0, and if y; = 5;,i=1,2,...,n, then 
RSL, 


Show that adding an x to the model increases (cannot decrease) the value of 
R?, as in property 3 in Section 7.7. 


(a) Verify that R? is invariant to full-rank linear transformations on the x’s as 
in property 6 in Section 7.7. 
(b) Show that R? is invariant to a scale change z = cy ony. 


(a) Show _ that R? in (7.55) can be written in the form 
R? = 1—SSE/ >>, (;—y)’. 

(b) Replace SSE and )°;(0; —y) in part (a) by variance estimators 
SSE/(n— k —1) and 3°; 6; — y)’/(n — 1) and show that the result is 
the same as R? in (7.56). 


Show that )~™_, §;/n = )°_, yi/n, as noted following (7.59) in Section 7.7. 
Show that cos 6 = R as in (7.61), where R? is as given by (7.56). 


7.32 


7.33 


7.34 
7.35 


7.36 


7.37 


7.38 


7.39 
7.40 


7A1 
7.42 


7.43 
7.44 
7.45 
7.46 
7.47 


PROBLEMS 181 


(a) Show that E(B) = B, where B = (X'V~-!X)~!X’V~!y as in (7.63). 
(b) Show that cov(B) = 02(X'V~!X)~! as in (7.64). 

(a) Show that the two forms of s* in (7.65) and (7.66) are equal. 

(b) Show that E(s”) = 0”, where s* is as given by (7.66). 

Complete the steps in the proof of Theorem 7.8b. 


Show that for V=(1—p)I+pJ in (7.67), the inverse is given by 
V-'=ad—bpJ) as in (7.68), where a=1/(1—p) and b=1/ 
[1 +(n— lpl. 

bn 0’ 


(a) Show that X‘V-!X = ( 0 axXxX ) as in (7.69). 


(b) Show that X’v_ly = es as in (7.70). 


Show that cov(B") = 02(X’X)!X’VX(X’X)"! as. in (7.72), where 

B* = (X/X)!X’y and cov(y) = &V. 

(a) Show that the weighted least-squares estimator B a (Bos Bi)! for the 
model y; = By + Byx; + €; with var(y;) = ox; has the form given in 
(7.73). 

(b) Verify the expression for cov(B) in (7.74). 

Obtain the expression for cov(B*) in (7.75). 

As an alternative derivation of var(B*) in (7.76), use the following two steps to 

find var(B*) using Bi = 90,05 — Dyi/ YC; Gi —x) from the answer to 

Problem 6.2: 

(a) Using var(yj) = o°x;, show that —-var(B¥) = 0? 30, (x; — 8)°x;/ 

—4272 
pares — x)’ ; 

(b) Show that this expression for var(B*) is equal to that in (7.76). 

Using x = 2, 3, 5, 7, 8, 10, compare var(B*) in (7.76) with var(B,) in (7.77). 

Provide an alternative proof of cov(B*) = o(X'X1) | in (7.81) using the 

definition in (3.24), cov(B}) = E{[Bi — E(BILB; — E(B}))'}. 

Prove Theorem 7.9b. 

Provide the missing steps in the proof of Theorem 7.9c(ii). 

Show that x01 Bi is biased for estimating xo; B, if B, 4 0 and X' Xo # O. 

Show that var(xo B,) > var(xo1 B*). 


Complete the steps in the proof of Theorem 7.9d. 


182 


7.48 


7.49 


7.50 


7.51 


7.52 


7.53 


7.54 


MULTIPLE REGRESSION: ESTIMATION 


Show that for the no-intercept model y; = 8} x; + €7, the least-squares estima- 
tor is BF = 7, xiyi/ 37,27 as in (7.86). 

Obtain E(B‘) = Bo xi/ 0,47 +B, in (7.87) using —_ (7.80), 
E(B}) = By + AB>. 

Suppose that we use the model y; = Bj + Bix + 
Yi = Bo + Byxi + Box; + B3x} + €- 


x 
i 


when the true model is 


(a) Using (7.80), find E(85) and E(B) if observations are taken at 
x=-3, —2, -1,0, 1,2, 3. 
(b) Using (7.85), find E(s?) for the same values of x. 


Show that X>.; = X» — X5(X)) is orthogonal to Xj, that is, X) X21 = O, as in 
(7.95). 

Show _ that B in (7.98) is the same as in the full fitted model 
y = XiB, + X2fo. 

When gasoline is pumped into the tank of a car, vapors are vented into the 
atmosphere. An experiment was conducted to determine whether y, the 


amount of vapor, can be predicted using the following four variables based 
on initial conditions of the tank and the dispensed gasoline: 


x, = tank temperature (°F) 
X2 = gasoline temperature (°F) 
x3 = vapor pressure in tank (psi) 


x4 = vapor pressure of gasoline (psi) 


The data are given in Table 7.3 (Weisberg 1985, p. 138). 
(a) Find B and s?. 

(b) Find an estimate of cov(B). 

(c) Find B, and Bo using S,, and §,, as in (7.46) and (7.47). 
(d) Find R? and R?. 


In an effort to obtain maximum yield in a chemical reaction, the values of the 
following variables were chosen by the experimenter: 


x, = temperature (°C) 
x2 = concentration of a reagent (%) 
x3 = time of reaction (hours) 

Two different response variables were observed: 


y, = percent of unchanged starting material 


y2 = percent converted to the desired product 


PROBLEMS 183 
TABLE 7.3 Gas Vapor Data 
y x] x2 X3 x4 y x] x2 x3 X4 
29 33 53 3.32 3.42 40 90 64 7.32 6.70 
24 31 36 3.10 3.26 46 90 60 7.32 7.20 
26 33 51 3.18 3.18 55 92 92 TAS TAS 
22 37 51 3.39 3.08 52 91 92 7.27 7.26 
DT 36 54 3.20 3.41 29 61 62 3.91 4.08 
21 35 35 3.03 3.03 22, 59 42 3.75 3.45 
33 59 56 4.78 4.57 31 88 65 6.48 5.80 
34 60 60 4.72 4.72 45 91 89 6.70 6.60 
32 59 60 4.60 4.41 37 63 62 4.30 4.30 
34 60 60 4.53 4.53 37 60 61 4.02 4.10 
20 34 35, 2.90 2.95 33 60 62 4.02 3.89 
36 60 59 4.40 4.36 27 59 62 3.98 4.02 
34 60 62 4.31 4.42 34 59 62 4.39 4.53 
23 60 36 4.27 3.94 19 37 35 2.75 2.64 
24 62 38 4.41 3.49 16 35 35 2.59 2.59 
32 62 61 4.39 4.39 22, 37 37 2.73 2.59 


The data are listed in Table 7.4 (Box and Youle 1955, Andrews and Herzberg 


1985, p. 188). Carry out the following for y;: 


(a) Find B and s?. 
(b) Find an estimate of cov(B). 


TABLE 7.4 Chemical Reaction Data 


y1 y2 x] x2 X3 
41.5 45.9 162 23 3 
33.8 53.3 162 23 8 
20s J 57.5 162 30 5 
21.7 58.8 162 30 8 
19.9 60.6 172 25 5 
15.0 58.0 172 25 8 
12.2 58.6 172 30 5 
4.3 52.4 172 30 8 
19.3 56.9 167 27.5 6.5 
6.4 55.4 177 27.5 6.5 
37.6 46.9 157 27.5 6.5 
18.0 57.3 167 32.5 6.5 
26.3 55.0 167 22.5 6.5 
9.9 58.9 167 OH fe) 9.5 
25.0 50.3 167 275 3.5 
14.1 61.1 177 20 6.5 
15.2 62.9 177 20 6.5 
15.9 60.0 160 34 pe) 
19.6 60.6 160 34 75 


184 


MULTIPLE REGRESSION: ESTIMATION 


TABLE 7.5 Land Rent Data 


y x] x2 x3 y x} x2 x3 
18.38 15.50 17.25 24 8.50 9.00 8.89 .08 
20.00 22.29 18.51 .20 36.50 20.64 23.81 24 
11.50 12.36 11.13 12 60.00 81.40 4.54 05 
25.00 31.84 5.54 12 16.25 18.92 29.62 72 
52.50 83.90 5.44 .04 50.00 50.32 21.36 19 
82.50 72.25 20.37 .05 11.50 21.33 1.53 .10 
25.00 27.14 31.20 27 35.00 46.85 5.42 .08 
30.67 40.41 4.29 .10 75.00 65.94 22.10 .09 
12.00 12.42 8.69 Al 31.56 38.68 14.55 17 
61.25 69.42 6.63 .04 48.50 51.19 7.59 13 
60.00 48.46 27.40 12 77.50 59.42 49.86 13 
57.50 69.00 31.23 .08 21.67 24.64 11.46 21 
31.00 26.09 28.50 21 19.75 26.94 2.48 .10 
60.00 62.83 29.98 17 56.00 46.20 31.62 .26 
72.50 77.06 13.59 05 25.00 26.86 53.73 43 
60.33 58.83 45.46 .16 40.00 20.00 40.18 56 
49.75 59.48 35.90 32 56.67 62.52 15.89 05 
(c) Find R? and R?. 
(d) In order to find the maximum yield for y;, a second-order model is of 
interest. Find B and s? for the model y; = Bo + Byx1 + Box2 + B3x3+ 
Baxy + 5x3 + Box3 + Brxix2 + Bgxixa + Boxrxs + 8. 
(e) Find R? and Re for the second-order model. 
7.55 The following variables were recorded for several counties in Minnesota in 


1977: 
y = average rent paid per acre of land with alfalfa 
x, = average rent paid per acre for all land 
X2 = average number of dairy cows per square mile 
x3 = proportion of farmland in pasture 


The data for 34 counties are given in Table 7.5 (Weisberg 1985, p. 162). Can 
rent for alfalfa land be predicted from the other three variables? 


(a) Find B and s?. 
(b) Find B, and Bo using S,, and s,, as in (7.46) and (7.47). 
(c) Find R? and R?. 


$ Multiple Regression: Tests of 
Hypotheses and Confidence 
Intervals 


In this chapter we consider hypothesis tests and confidence intervals for the 
parameters By, B,,..., 8, in B in the model y = Xf + e. We also provide a confi- 
dence interval for a? = var(y;). We will assume throughout the chapter that y is 
N,CXB, oI), where X is n x (k + 1) of rank k+1 <n. 


8.1 TEST OF OVERALL REGRESSION 


We noted in Section 7.9 that the problems associated with both overfitting and under- 
fitting motivate us to seek an optimal model. Hypothesis testing is a formal tool for, 
among other things, choosing between a reduced model and an associated full model. 
The hypothesis Ho, expresses the reduced model in terms of values of a subset of the 
8;’s in B. The alternative hypothesis, H, is associated with the full model. 

To illustrate this tool we begin with a common test, the test of the overall 
regression hypothesis that none of the x variables predict y. This hypothesis 
(leading to the reduced model) can be expressed as Ho: B, =90, where 
B, = (B,, Bo,---, B,)’. Note that we wish to test Hy: B, = 0, not Ho : B = 0, where 


p= (6) 


Since Bp is usually not zero, we would rarely be interested in including By = 0 in the 
hypothesis. Rejection of Ho: B = 0 might be due solely to By, and we would not 
learn whether the x variables predict y. For a test of Hp : B = 0, see Problem 8.6. 
We proceed by proposing a test statistic that is distributed as a central F if Ho is true 
and as a noncentral F otherwise. Our approach to obtaining a test statistic is somewhat 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


185 


186 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 


simplified if we use the centered model (7.32) 


— fi a 
y=6.X0(g ) +e 


where X, = [I — (1/n)J]X; is the centered matrix [see (7.33)] and X; contains all the 
columns of X except the first [see (7.19)]. The corrected total sum of squares 
SST = 37, 6; — ¥) can be partitioned as 


> 01-9 = BX + [SO 1? - BX] bby 7.53)] 
i=1 
= B,X.X.B, + SSE = SSR +SSE_ [by (7.54)], (8.1) 


where SSE is as given in (7.39). The regression sum of squares SSR = BX'X-B, is 
clearly due to B,. 

In order to construct an F test, we first express the sums of squares in (8.1) as quad- 
ratic forms in y so that we can use theorems from Chapter 5 to show that SSR and 
SSE have chi-square distributions and are independent. Using 5°; (i — yy = 


y [I — (1/n)Jly in (5.2), B, = (X/-X.)-1X¢y in (7.37), and SSE = S™_, (i — 3° 
B,X'y in (7.39), we can write (8.1) as 


1 
y’ (1 = ~3)y = SSR + SSE 
n 
1 
=> y'X.(X).X,) 'X!y + y (1 = 73) y—- y'X.(X/.X,) /Xly 
J / 1 
=yHMy+y (1-23-1.). (8.2) 


where H, = X,(X/.X.)'X’.. 
In the following theorem we establish some properties of the three matrices of the 
quadratic forms in (8.2). 


Theorem 8.1a. The matrices I — (1/n) J, He = X-(X/.X.)'X!,,andI — (1/n) J — H. 
have the following properties: 


(i) He — (/n) J] = He. (8.3) 
(ii) H, is idempotent of rank k. 
(iii) I— (1/n) J — H is idempotent of rank n — k — 1. 
(iv) H.[1— (1/n) J-—H,.] = O. (8.4) 


Proor. Part (i) follows from xj = 0, which was established in Problem 7.16. Part (ii) 
can be shown by direct multiplication. Parts (iii) and (iv) follow from (i) and (ii). 


8.1 TEST OF OVERALL REGRESSION 187 


The distributions of SSR/o? and SSE/o? are given in the following theorem. 
Theorem 8.1b. If y is N,(XB,o7D), then SSR/o? = B,X!X.B,/o0? and 
SSE/a? = es (i — »)° — B,X/.X.B,|/o? have the following distributions: 

(i) SSR/o? is x7(k, As), where Ay = p’Ap/207 = B\X'X.B,/20°. 

(ii) SSE/o? is y2(n —k — 1). 


Proor. These results follow from (8.2), Theorem 8.la(ii) and (iii), and Corollary 2 to 
Theorem 5.5. 


The independence of SSR and SSE is demonstrated in the following theorem. 


Theorem 8.1c. If y is N,(XB, 771), then SSR and SSE are independent, where SSR 
and SSE are defined in (8.1) and (8.2). 


Proor. This follows from Theorem 8.la(iv) and Corollary 1 to Theorem 5.6b. 


We can now establish an F test for Ho: B, = 0 versus Hi: B, 4 90. 


Theorem 8.1d. If y is N,(XB, 07), the distribution of 


_ SSR/(ko?) ss SSR/K 
SSE/[(n —k—1o?] SSE/m—k—1) 


(8.5) 
is as follows: 
@) If Ho: B, = 0 is false, then 
F is distributed as F(k, n —k — 1, A), 


where A, = BX.X.B,/207. 
Gi) If Ho: B, = 0 is true, then A; = 0 and 


F is distributed as F(k, n — k — 1). 


PROOF 


(i) This result follows from (5.30) and Theorems 8.1b and 8.1c. 
(ii) This result follows from (5.28) and Theorems 8.1b and 8.1c. 


Note that A; = 0 if and only if B, = 0, since XX. is positive definite (see Corollary 
1 to Theorem 2.6b). 


188 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 


TABLE 8.1 ANOVA Table for the F Test of Ho : B; = 0 


Source of Expected Mean 
Variation df Sum of Squares Mean Square Square 
Due to B; k SSR = BiX'y = pX'y — ny" SSR/k a? + 7 B\X.XB 
Error n—-k—-1 SSE = S"(yi —9)° — BiXLy SSE/(n—k-1) o? 
=yy-BX'y 
Total n-1 SST = 37,07 - xy 


The test for Hp : B, = 0 is carried out as follows. Reject Ho if F > Fan—-1; 
where F'axn—k—-1 18 the upper a@ percentage point of the (central) F distribution. 
Alternatively, a p value can be used to carry out the test. A p value is the tail area 
of the central F distribution beyond the calculated F value, that is, the probability 
of exceeding the calculated F value, assuming Ho : B,; = 0 to be true. A p value 
less than @ is equivalent to F > Foxn—K-1- 

The analysis-of-variance (ANOVA) table (Table 8.1) summarizes the results and 
calculations leading to the overall F test. Mean squares are sums of squares 
divided by the degrees of freedom of the associated chi-square (y~) distributions. 

The entries in the column for expected mean squares in Table 8.1 are simply 
E(SSR/k) and E[SSE/(n —k—1)]. The first of these can be established by 
Theorem 5.2a or by (5.20). The second was established by Theorem 7.3f. 

If Ho : B, = 0 is true, both of the expected mean squares in Table 8.1 are equal to 
o*, and we expect F to be near 1. If B, # 0, then E(SSR/k) > o? since X/_X, is posi- 
tive definite, and we expect F to exceed 1. We therefore reject Hp for large values of F. 

The test of Hp : B, = 0 in Table 8.1 has been developed using the centered model 
(7.32). We can also express SSR and SSE in terms of the noncentered model 
y= XB + € in (7.4): 


SSR = B’X'y — ny’, SSE=y’'y— BX’y. (8.6) 


These are the same as SSR and SSE in (8.1) [see (7.24), (7.39), (7.54), and Problems 
7A9, 7.25]. 


Example 8.1. Using the data in Table 7.1, we illustrate the test of Hp : B, = 0 where, 
in this case, B, = (B,, B>)’. In Example 7.3.1(a), we found X’y = (90, 482, 872)’ and 
B = (5.3754, 3.0118, —1.2855)’. The quantities y’y, BX’'y, and ny” are given by 


12 
i=1 
90 
B’X'y = (5.3754, 3.0118, —1.2855)| 482 | — 814.5410, 
872 


8.2 TEST ON A SUBSET OF THE p’S 189 


TABLE 8.2 ANOVA for Overall Regression Test for Data in Table 7.1 
Source df SS MS F 


Due to B; 2 139.5410 69.7705 24.665 
Error 9 25.4590 2.8288 
Total 11 165.0000 


Thus, by (8.6), we obtain 
SSR = P'X'y — ny” = 139.5410, 
SSE = y'y — B’X'y = 25.4590, 
S\ 01-3 = yy — ny = 165. 
i=1 


The F test is given in Table 8.2. Since 24.665 > F.095,29 = 4.26, we reject Hy: B, =0 
and conclude that at least one of 8; or B, is not zero. The p value is .000223. 


8.2. TEST ON A SUBSET OF THE p’S 


In more generality, suppose that we wish to test the hypothesis that a subset of the x’s 
is not useful in predicting y. A simple example is Ho: 6; = 0 for a single B;. If Ho is 
rejected, we would retain Bix; in the model. As another illustration, consider the 
model in (7.2) 


Y = Bo + Bim + Boxe + B3xq + Bax + Bsxix2 + 8, 


for which we may wish to test the hypothesis Ho: B; = By = Bs; = 0. If Ho is 
rejected, we would choose the full second-order model over the reduced first-order 
model. 

Without loss of generality, we assume that the B’s to be tested have been arranged 
last in B, with a corresponding arrangement of the columns of X. Then B and X can 
be partitioned accordingly, and by (7.78), the model for all n observations becomes 


y=XBt+e= ai.xo(F ) +¢€ 
B, 


= XB, + X2B, + €, (8.7) 


190 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 


where $B, contains the B’s to be tested. The intercept By would ordinarily be 
included in B,. 

The hypothesis of interest is Ho: B, = 0. If we designate the number of par- 
ameters in B, by A, then Xo is nxh, B, is (kK-—h+1)x 1, and X; is 
nx (k—h+1). Thus B; = (Bo, Bi, +++, Ben)’ and By = (By_nyi, -**» Bp’. In 
terms of the illustration at the beginning of this section, we would have 
B, = (Bo; Bi, Bo)’ and B, = (B3, By, Bs)’. Note that B, in (8.7) is different from 


B, in Section 8.1, in which B was partitioned as B = e and B, constituted 
1 


all of B except Bo. 
To test Hp: By = 0 versus H;: B, ¥ 0, we use a full—reduced-model approach. 
The full model is given by (8.7). Under Hy: B, = 0, the reduced model becomes 


y=Xipi +e. (8.8) 


We use the notation B; and e* as in Section 7.9, because in the reduced model, B; 
and e* will typically be different from B, and e in the full model (unless X; and X 
are orthogonal; see Theorem 7.9a and its corollary). The estimator of B} in the 
reduced model (8.8) is Bi = (XX) 'Xty, which is, in general, not the same as 
the first k — h + 1 elements of B = (X'X)"'X’y from the full model (8.7) (unless 
X;, and Xp» are orthogonal; see Theorem 7.10). 

In order to compare the fit of the full model (8.7) to the fit of the reduced model 
(8.8), we add and subtract px’ y and Be Xy to the total corrected sum of squares 
7, (y; — 9” = y'y — ny” so as to obtain the partitioning 


y'y — ny’ = (yy — B’'Xy) + (BX'y — BY Xy) + (BY X1y — ny’) (8.9) 


or 
SST = SSE + SS(B,|B,) + SSR(reduced), (8.10) 


where SS(B,|B,) = B’X’'y = BEX is the “extra” regression sum of squares due to 
GB, after adjusting for B,. Note that SS(B,|8,) can also be expressed as 


SS(Bp|B1) = BX'y — ny” — (Bi Xiy — ny”) 
= SSR( full) — SSR(reduced), 


which is the difference between the overall regression sum of squares for the full 
model and the overall regression sum of squares for the reduced model [see (8.6)]. 

If Ho: B, = 0 is true, we would expect SS(B,|B,) to be small so that SST in 
(8.10) is composed mostly of SSR(reduced) and SSE. If B, 4 0, we expect 
SS(B,|B,) to be larger and account for more of SST. Thus we are testing 
Ho: By = 9 in the full model in which there are no restrictions on B,. We are not 
ignoring B, (assuming B, = 0) but are testing Ho: B, = 0 in the presence of B,, 
that is, above and beyond whatever $B, contributes to SST. 


8.2 TEST ON A SUBSET OF THE p’S 191 


To develop a test statistic based on SS(B,|B,), we first write (8.9) in terms of quad- 
ratic forms in y. Using B = (X’X)'X’y and B¥ = (X{X,) | X/y and (5.2), (8.9) 
becomes 


1 
¥ (1 = 73) y=yy—yX(X'X) 'X’y + y'X(X'X)'X’y 


U —-lw/ / | —lwi ,l 
—yXi(X,/X1) Xy+y XXX) Xy—y aay 
= y [I— X(X/X)'X']y + y'EX(X’X)'X’ — X\ (XX) X ly 


1 
+y’ XuexiXi"X - sly (8.11) 
1 
=y(-My+y(H-Hpy+y (4 -73)x, (8.12) 


where H = X(X’X)~!X’ and Hy = X;(X',X)"!X{. The matrix I — H was shown to 
be idempotent in Problem 5.32a, with rank n — k — 1, where k + | is the rank of X 
(k + 1 is also the number of elements in B). The matrix H — H, is shown to be idem- 
potent in the following theorem. 


Theorem 8.2a. The matrix H — H, = X(X'X)'X’ — X,(X/X1)"!X’, is idempotent 
with rank h, where h is the number of elements in 5. 


Proor. Premultiplying X by H, we obtain 
HX = X(X’X) 'X’/X = X 
or 


X = [X(X’X)'X’]X. (8.13) 


Partitioning X on the left side of (8.13) and the last X on the right side, we obtain [by 
an extension of (2.28)] 


(X1, Xo) = [X(X"X)71X"] (Xi, Xp) 
= [X(X/X)!X’X), X(X'X)'X’X]. 


Thus 


X; = X(X’X) | X’X), 
(8.14) 
X> = X(X/X)'X'X). 


192 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 
Simplifying HH, and H)H by (8.14) and its transpose, we obtain 


HH, =H; and H,H=H). (8.15) 


The matrices H and H, are idempotent (see Problem 5.32). Thus 


(H — H,)’ = H’ — HH, — H|H+ H; 
=H-H,-H,+H, 
=H-H,, 


and H — Hj, is idempotent. For the rank of H — Hj), we have (by Theorem 2.13d) 


rank(H — H,) = tr(H — H,) 
= tr(H) — tr(H)) [by (2.86)] 


= tr[X(X’X)7!X’] — tr[X) (XX) 1X4] 
= tr[X’X(X’X)'] — te[X)X(X,X1)'] [by (2.87)] 
= tri) — rdkengt) =kK+1--h4+D=A, 


We now find the distributions of y/(I — Hy and y’(H — Hj)y in (8.12) and show 
that they are independent. 


Theorem 8.2b. If y is N,,(XB, 71) and H and H, are as defined in (8.11) and (8.12), 
then 


(i) yi — Wy/o? is x7(n—k — 1). 
(i) y/(H—Hy)y/o7 is x7(h, A1), At = BS [X$X2 — X5Xi (XK X1) |X} X2] By /20°. 
(iti) y/(I — Wy and y'(H — H))y are independent. 


Proor. Adding y'(1/n)Jy to both sides of (8.12), we obtain the decomposition 
yy =y(-—-HWy+y(H-—H))y+y Hy. The matrices I—H,H—Hy,, and H, 
were shown to be idempotent in Problem 5.32 and Theorem 8.2a. Hence by 
Corollary 1 to Theorem 5.6c, all parts of the theorem follow. See Problem 8.9 for 
the derivation of A,. 


If A; = 0 in Theorem 8.2b(ii), then y'(H — H,)y/o? has the central chi-square 
distribution x7(h). Since X)X» — X5X:(X'X1)'X/X» is positive definite (see 
Problem 8), A, = 0 if and only if B, = 0. 

An F test for Ho: B, = 0 versus H,: B, # 0 is given in the following theorem. 


8.2 TEST ON A SUBSET OF THE f’S 193 


Theorem 8.2c. Let y be N,(XB, o7I) and define an F statistic as follows: 


_ (= Hyy/h_—__SS(B,|B,)/h 
7 ydi— Wy/(n—k—1) SSE/(n—k—1) (8.16) 


7 (BX'y — Bi Xy)/h (8.17) 
(y'y — B’X’y)/(n—k— 1) 


where B = (X'X)~!X’y is from the full model y = XB + € and B* = (X,X,)'Xiy 
is from the reduced model y = X;B,* + &*. The distribution of F in (8.17) is as 
follows: 


Gi) If Ho: B, = 0 is false, then 
F is distributed as F(h, n — k — 1, A), 


where Ay = fi [X5X2 — X5Xi(XX1) |X) X2] By /207. 
Gi) If Ho: B, = 0 is true, then A; = 0 and 


F is distributed as F(h, n — k — 1). 


PROOF 


(i) This result follows from (5.30) and Theorem 8.2b. 
Gi) This result follows from (5.28) and Theorem 8.2b. 


The test for Ho: B, = 0 is carried out as follows: Reject Ho if F > Farn—x-1, 
where Fojn—zx—1 18 the upper @ percentage point of the (central) F distribution. 
Alternatively, we reject Ho if p<a, where p is the p value. Since 
X)X2 — XX) (X}X,) 1X) X2 is positive definite (see Problem 8.10), A, > 0 if 
Ho: B, = 9 is false. This justifies rejection of Ho for large values of F. 

Results and calculations leading to this F test are summarized in the ANOVA table 
(Table 8.3), where B, is (kK —h+1)x 1, B, ish x 1, X, isn x (kK —h+ 1), and X 
isn xh. 

The entries in the column for expected mean squares are E[SS(B,|B,)/h] and 
E[SSE/(n — k — 1)]. For E[SS(B,|B,)/], see Problem 8.11. Note that if Ho is 
true, both expected mean squares (Table 8.3) are equal to a”, and if Hp is false, 
E[SS(B>|B,)/h] > E[SSE/(n — k — 1)] since XX. — X5X\(XX,)_'X',X2 is posi- 
tive definite. This inequality provides another justification for rejecting Hp for large 
values of F. 


7fu —&,K = ISS T-u 010], 


<9 =(1-¥-)/ASS Axg —&A=aSS  [—y-u Jou 
9 lX!X, CX IX) KSX — XX] $+ 20 y/('d |“ )Ss Axis — 4x9 = ('d|Ass y Ig 10} parsn{pe “J 01 ong 
arenbs uray powodxgq arenbs uesj sorenbs jo uns Jp uoneneA Jo s01no0g 


0 = A : °F JO SAT-7 10J G2, VAONV 8 WIAVL 


194 


8.2 TEST ON A SUBSET OF THE f’S 195 


Example 8.2a. Consider the dependent variable yz in the chemical reaction data in 
Table 7.4 (see Problem 7.52 for a description of the variables). In order to check 
the usefulness of second-order terms in predicting y2, we use as a full model, 
y2 = By + Bix1 + Byx2 + Byx3 + Bart + Bsx5 + Box + Byxix2 + Bgxix3 + Boxrx3 +8, 
and test Ho:64= 6s =...=B)=0. For the full model, we obtain BX'y—ny= 
339.7888, and for the reduced model y2=fo+ Bjx1+852x2+P3x3+e*, we 
have Bi Xi y—ny? =151.0022. The difference is B’X’y—% X/y=188.7866. The 
error sum of squares is SSE= 60.6755, and the F statistic is given by (8.16) or Table 8.3 as 


__188.7866/6 31.4644 
~ 60.6755/9 6.7417 


=4.6671, 


which has a p value of .0198. Thus the second-order terms are useful in prediction of 
y2. In fact, the overall F in (8.5) for the reduced model is 3.027 with p=.0623, so that 
X1,X2, and x3 are inadequate for predicting y2. The overall F for the full model is 5.600 
with p = .0086. 


In the following theorem, we express SS(B,|8,) as a quadratic form in Bo that 
corresponds to A; in Theorem 8.2b(ii). 


Theorem §8.2d. If the model is_ partitioned as in (8.7), then 
SS(B,|B,) = B’X’y — BY X‘y can be written as 


SS(B5|B,) = By [X5X2 — X5Xi(X,X1)~ |X, Xo] By, (8.18) 


where B, is from a partitioning of B in the full model: 


p= (&) = (X'X)"!X’y. (8.19) 
2 


Proor. We can write xB in terms of B, and Bo as xB = (Xj, X) ( A ) = Xi B\+ 
2 


X>B). To write B* in terms of B, and B,, we note that by (7.80), E(B¥) = B, + ABb, 
where A = (XX) 1X) X2 is the alias matrix defined in Theorem 7.9a. This can be 


estimated by B; a B, + App, where B, and B, are from the full model, as in (8.19). 
Then SS(B,|8;,) in (8.10) or Table 8.3 can be written as 


SS(B>|B,) = B'X'y — BEX‘ y 
= B'X'XB — BY XX B" [by (7.8)] 
= (BX), + BYX$)(XB, + XB) — (B, + BLAX).Xi(B, + AB,). 


Multiplying this out and substituting (XX))~'X',X» for A, we obtain (8.18). 


196 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 


In (8.18), it is clear that SS(B,|B,) is due to B,. We also see in (8.18) a direct corre- 
spondence between SS(B,|B,) and the noncentrality parameter A; in Theorem 8.2b 
(ii) or the expected mean square in Table 8.3. 


Example 8.2b. The full—reduced-model test of Hp : B. = 0 in Table 8.3 can be used 
to test for significance of a single B;. To illustrate, suppose that we wish to 
test Hp: B, = 0, where B is partitioned as 


Then X is partitioned as X = (Xj, x;), where x; is the last column of X and X, contains 
all columns except x;. The reduced model is y = X;B; + &*, and f; is estimated as 


B; — (XX) 'Xiy. In this case, / = 1, and the F statistic in (8.17) becomes 
BX'y — B, X\y 
(y'y — B’X’y)/(n—k — 1) 


(8.20) 


which is distributed as F(1, n — k — 1) if Ho: B, = 0 is true. 


Example 8.2c. The test in Section 8.1 for overall regression can be obtained as a full— 
reduced-model test. In this case, the partitioning of X and of B is X = (j, X;) and 


Bo 
= By = 2 
as = (3). 
By 


The reduced model is y = 6j + €*, for which we have 


Bo =y and SS(P}) = ny” (8.21) 


(see Problem 8.13). Then SS(B;|8o) = pX'y — ny’, which is the same as (8.6). 


8.3 F TEST IN TERMS OF R? 


The F statistics in Sections 8.1 and 8.2 can be expressed in terms of R* as defined 
in (7.56). 


8.3. F TEST IN TERMS OF R? 197 


Theorem 8.3. The F statistics in (8.5) and (8.17) for testing Ho : B, =0 and 
Ho : B, = 9, respectively, can be written in terms of R? as 


(B'X'y — nj*)/k 


= = (8.22) 
(y'y — BX'y)/(n—k—-1) 
7 R’/k 
(—R/(m—k—1) eae 
and 
(B at — B, X\y)/h (8.24) 
(y'y — BX'y)/(n—k—-1) 
2 p2 
cama (8.25) 


~ (1 -R)/@—k- 1)’ 


where R? for the full model is given in (7.56) as R? = (f'X'y — ny”)/(y'y — ny*) and 
R? for the reduced model y = X, f} + € in (8.8) is similarly defined as 


B*'x' a wee 
jee Sh Ln (8.26) 
y'y — ny 


Proor. Adding and subtracting ny* in the denominator of (8.22) gives 


(B'X'y — ny*)/k 


F= - 
[y’y — ny? — (BX'y — ny)\/(n — k — 1) 


Dividing numerator and denominator by y’y — ny” yields (8.23). For (8.25), see 
Problem 8.15. 


In (8.25), we see that the F test for Hp : B. = 0 is equivalent to a test for significant 
reduction in R?. Note also that since F > 0 in (8.25), we have R? > R?, which is an 
additional confirmation of property 3 in Section 7.7, namely, that adding an x to the 
model increases R?. 


Example 8.3. For the dependent variable y. in the chemical reaction data in 
Table 7.4, a full model with nine x’s and a reduced model with three x’s were con- 
sidered in Example 8.2a. The values of R* for the full model and reduced model 
are .8485 and .3771, respectively. To test the significance of the increase in R? 


198 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 


from .3771 to .8485, we use (8.25) 


pa RR /h___ (8485 — .3771)/6 
(1—R2)/n—k—1) (1 —.8485)/9 
07857 

= Gregg = 20671, 


which is the same as the value obtained for F in Example 8.2a. 


8.4 THE GENERAL LINEAR HYPOTHESIS TESTS FOR 
Hy:CB =0 AND Hy:CB=t 


We discuss a test for Hp : CB = 0 in Section 8.4.1 and a test for Hp: CB =t in 
Section 8.4.2. 


8.4.1 The Test for Hy : CB = 0 


The hypothesis Hp : CB = 0, where C is a known q x (k + 1) coefficient matrix of 
rank q < k + 1, is known as the general linear hypothesis. The alternative hypothesis 
is Hi: CB ¥ 0. The formulation Hp : CB = 0 includes as special cases the hypoth- 
eses in Sections 8.1 and 8.2. The hypothesis Ho: B; = 0 in Section 8.1 can be 
expressed in the form Hp : CB = 0 as follows 


Hy : CB = (0, t( | = B, =0 [by (2.36)], 


where 0 is ak x 1 vector. Similarly, the hypothesis Hp : B) = 0 in Section 8.2 can be 
expressed in the form Hp : CB = 0: 


Hy : CB = (O, wf) = B, = 9, 


where the matrix O is h x (k —h-+ 1) and the vector 0 is h x 1. 
The formulation Hp : CB = 0 also allows for more general hypotheses such as 


Ay : 2B, — By = By — 28, + 3By = B, — By = 9, 


which can be expressed as follows: 


Bo 
0 2 .=1°.0- Oy] Bs 0 
Hy: {0 0 2 3])le,/=[0 
01 CASE lps 0 


8.4 THE GENERAL LINEAR HYPOTHESIS TESTS FOR H):CB=0 AND Hp:CB=t 199 


As another illustration, the hypothesis Hp : 8B, = B, = B; = B, can be expressed in 
terms of three differences, Hp : 8B, — B. = Bo — B3 = Bz — B4 = 9, or, equivalently, 
as Hy : CB = 0: 


Bo 
Od Si oe 20 (se; 0 
Hy: {0 0 1 -1 O}|] B]=([0 
OO OE Sp 0 


In the following theorem, we give the sums of squares used in the test of 
Ho: CB=0 versus H,;: CB 4 0, along with the properties of these sums of 
squares. We denote the sum of squares due to CB (due to the hypothesis) as SSH. 


Theorem 8.4a. If y is distributed N,(XB, 071) and C is g x (k+1) of rank 
q<k+1, then 


(i) CB is N,[CB, o2C(X'X)!C']. 
(ii) SSH/o? = (CB) [C(X'X)“!C'}"'CB/o? is y2(g, A), 
where A = (CB)'[C(X’ X)'!C'}'CB/20?. 
(iii) SSE/o? = y/[I — X(X'X)"!X’ y/o? is y2(n—k — 1). 
(iv) SSH and SSE are independent. 


PROOF 


(i) By Theorem 7.6b (i), B is Ne+i LB, o?(X’X)~']. The result then follows by 
Theorem 4.4a (ii). 


(ii) Since cov(CB) = 0?C(X'X) '!C’ and o?[C(X'X) !C]"'C(X’X) '!C'/o? a 
I, the result follows by Theorem 5.5. 

(iii) This was established in Theorem 8.1b(i). 

(iv) Since B and SSE are independent [see Theorem 7.6b(iii)], SSH = 


BC'[C(X'X) '!C'ICB and SSE are also independent (Seber 1977, pp. 17, 
33-34). For a more formal proof, see Problem 8.16. 


The F test for Hp : CB = 0 versus H; : CB # 0 is given in the following theorem. 


Theorem 8.4b. Let y be N,(XB, o7I) and define the statistic 


= SSH/q 
SSE/(n — k= 1) 
aI a -1q~-lo;g 
_ CBYICRX) CT 'CB/a (8.27) 
SSE/(n —k = 1) 


where C is g x (k + 1) of rank g < k + 1 and B = (X’X)"!X’y. The distribution of F 
in (8.27) is as follows: 


200 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 
(i) If Hp: CB = 0 is false, then 


F is distributed as F(g,n — k — 1, A), 


where A = (CB) [C(X'X)"'C] 'CB/20?. 
(ii) If Hp: CB = 0 is true, then 


F is distributed as F(g, n — k — 1). 


PROOF 


(i) This result follows from (5.30) and Theorem 8.4a. 
(ii) This result follows from (5.28) and Theorem 8.4a. 


The F test for Hy : CB = 0 in Theorem 8.4b is usually called the general linear 
hypothesis test. The degrees of freedom g is the number of linear combinations in 
CB. The test for Hy: CB=0 is carried out as follows. Reject Ho if 
F > Fogn—k-1, where F is as given in (8.27) and Fy.4n—x-1 1s the upper @ percentage 
point of the (central) F distribution. Alternatively, we can reject Ho if p < a where p 
is the p value for F. [The p value is the probability that F(g,n — k — 1) exceeds the 
observed F value.] Since C(X’X)~!C’ is positive definite (see Problem 8.17), A > Oif 
Hp is false, where A = (CBY [C(X'X)'C']-'CB/207. Hence we reject Hy : CB = 0 
for large values of F. 

In Theorems 8.4a and 8.4b, SSH could be written as (CB — 0)'[C(X’X)!C’}"! 
(CB — 0), which is the squared distance between CB and the hypothesized value 
of CB. The distance is standardized by the covariance matrix of Cp. Intuitively, if 
Ho is true, CB tends to be close to 0 so that the numerator of F in (8.27) is small. 
On the other hand, if CB is very different from 0, the numerator of F tends to be large. 

The expected mean squares for the F test are given by 


SSH 1 - 
(=) =o? +7 (By [c(x’x)'C'] 'CB, (8.28) 


e(—SSE_\ _ 2 
Ah) 


These expected mean squares provide additional motivation for rejecting Ho for large 
values of F. If Ho is true, both expected mean squares are equal to 07; if Ho is false, 
E(SSH/q) > E[SSE/(n — q — 1)]. 

The F statistic in (8.27) is invariant to full-rank linear transformations on the x’s or 
on y. 


Theorem 8.4c. Let z = cy and W = XK, where K is nonsingular (see Corollary | to 
Theorem 7.3e for the form of K). The F statistic in (8.27) is unchanged by these 
transformations. 


8.4 THE GENERAL LINEAR HYPOTHESIS TESTS FOR H):CB=0 AND Ho:CB=t 201 


Proor. See Problem 8.18. 


In the first paragraph of this section, it was noted that the hypothesis Hp: B, = 0 
can be expressed in the form Hp : CB = 0. Since we used a full—reduced-model 
approach to develop the test for Hp : B, = 0, we expect that the general linear hypoth- 
esis test is also a full—reduced-model test. This is confirmed in the following theorem. 


Theorem 8.4d. The F test in Theorem 8.4b for the general linear hypothesis 
Ho : CB = 0 is a full—reduced-model test. 


Proor. The reduced model under Ho is 
y = XB + € subject to CB = 0. (8.29) 


Using Lagrange multipliers (Section 2.14.3), it can be shown (see Problem 8.19) that 
the estimator for B in this reduced model is 


B. = B— (XX) 'C'[C(X'X) CCB, (8.30) 


where B = (XX)! X’y is estimated from the full model unrestricted by the hypoth- 


esis and the subscript c in B. indicates that B is estimated subject to the constraint 
CB = 0. In (8.29), the X matrix for the reduced model is unchanged from the full 


model, and the regression sum of squares for the reduced model is therefore BX'y 
(for a more formal justification of BX'y, see Problem 8.20). Hence, the regression 
sum of squares due to the hypothesis is 


SSH = p'X'y — BLX’y. (8.31) 
By substituting B. [as given by (8.30)] into (8.31), we obtain 
SSH = (CB)'[C(X'X) 'C')'CB (8.32) 


(see Problem 8.21), thus establishing that the F test in Theorem 8.4b for Hy): CB = 0, 
is a full—reduced-model test. 


Example 8.4.1a. In many cases, the hypothesis can be incorporated directly into the 
model to obtain the reduced model. Suppose that the full model is 


Yi = Bo + Bia + Boxi2 + B3xi3 + €; 
and the hypothesis is Hp: 6; = 25. Then the reduced model becomes 


Yi = Bo + 2Poxi1 + Boxi2 + B3xi3 + &; 
= Be + Be2o(2xi + x2) + Bx + Ei, 


202 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 


where £,, indicates a parameter subject to the constraint 8; = 28,. The full model 


and reduced model could be fit, and the difference SS(B,|B,) = B’X’y — BY X\y 
would be the same as SSH in (8.32). 


If CB ¥ 0, the estimator B. in (8.30) is a biased estimator of B, but the variances 
of the Ba’ in B. are reduced, as shown in the following theorem. 


Theorem 8.4e. The mean vector and covariance matrix of B. in (8.30) are as 
follows: 


(i) E(B.) = B— (X'X)'C'[CX’'xy 'C'} CB. (8.33) 
(ii) cov(B,) = 0?(X’X)"! — 6 2(X’X) '!C! [ccy’x)'e')} eex’xy (8.34) 


Proor. See Problem 8.22. 


Since the second matrix on the right side of (8.34) is positive semidefinite, the 
diagonal elements of cov(B.) are less than those of cov(B) = o?(X'X)"!; that is, 


var(B,,) < var(B;) for 7 = 0, 1, 2, --- , k, where Boj is the jth diagonal element of 
cov(B..) in (8.34). This is analogous to the inequality var(B*) < var(B;) in 


Theorem 7.9c, where B; is from the reduced model. 


Example 8.4.1b. Consider the dependent variable y,; in the chemical reaction 
data in Table 7.4. For the model y; = By + Byx1 + Box. + B3x34+ €, 
we test Hp : 28, = 2B, = B using (8.27) in Theorem 8.4b. To express Ho in the 
form CB = 0, the matrix C becomes 


0 1 -!i 0 
cS € 0 2 ea 
and we obtain 


i= ee 
~ \.6118)’ 


.003366 poe 
— 006943 .044974 /’ 


Cee .003366 see ee A 
—.6118 —.006943 .044974 —.6118 


5.3449 


C(xX'X) 'C' = ( 


= 


__ 28.62301/2 


Sado 


which has p = .101. 


8.4 THE GENERAL LINEAR HYPOTHESIS TESTS FOR H):CB=0 AND Ho:CB=t 203 


8.4.2 The Test for Hyp: CB =t 


The test for Hp : CB = tis a straightforward extension of the test for Hp : CB = 0. With 
the additional flexibility provided by ¢, we can test hypotheses such as Hy: B, = 6, + 5. 
We assume that the system of equations C@=t is consistent, that is, that 
rank(C) = rank(C, t) (see Theorem 2.7). The requisite sums of squares and their prop- 
erties are given in the following theorem, which is analogous to Theorem 8.4a. 


Theorem 8.4f. If y is N,(XB, 071) and C is g x (k + 1) of rank g < k +1, then 


(i) CB —t is N,[CB — t, o?C(X’X)'C']. 
(ii) SSH/o? = (CB — ty [ccx’xy'C'] (cB — t)/o? is x2(q, A) 
where A = (CB — ty’ [C(X’X)"!C'}- (CB — t)/20?. 
(iii) SSE/o? = y'[I — X(X'X)"!X' y/o? is x2(n — k — 1). 
(iv) SSH and SSE are independent. 


PROOF 


(i) By Theorem 7.6b (i), B is Nei lB, o?(X’X)~']. The result follows by 

Corollary 1 to Theorem 4.4a. 

(ii) By part (i), cov(C B — t) = 0° C(X'X) !C’. The result follows as in the proof 
of Theorem 8.4a (ii). 

(iii) See Theorem 8.1b (ii). 

(iv) Since B and SSE are independent [see Theorem 7.6b (iii)], SSH and SSE are 
independent [see Seber (1977, pp. 17, 33—34)]. For a more formal proof, see 
Problem 8.23. 


An F test for Hp : CB = t versus H,; : CB # tis given in the following theorem, 
which is analogous to Theorem 8.4b. 


Theorem 8.4g. Let y be N,(X, 71) and define an F statistic as follows: 


____SSH/q 
SSE/(n — k — 1) 
Rp +)y y-lol lap 
_ (CB! [CX’x) 'C] (CB d/4 fics 
SSE/(n — k — 1) 


where B = (X’'X)!X’y. The distribution of F in (8.35) is as follows: 
(i) If Ho : CB = t is false, then 
F is distributed as F(g, n — k — 1, A), 


where A = (CB — t) [C(X'X)'C'} (CB — t)/20?. 


204 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 


(ii) If Hp: CB = t is true, then A = O and 


F is distributed as F(g, n — k — 1). 


PROOF 


(i) This result follows from (5.28) and Theorem 8.4f. 
(ii) This result follows from (5.30) and Theorem 8.4f. 


The test for Ho: CB = t is carried out as follows. Reject Ho if F > Fagn—k-1; 
where Fog,n-k-1 18 the upper @ percentage point of the central F’ distribution. 
Alternatively, we can reject Hp if p < a, where p is the p value for F. 

The expected mean squares for the F test are given by 


(=) =o" +-(CB ty [CO’XY'C] “CB - 0, 


e(_SSE_\ _ 2 
=) ee 


By extension of Theorem 8.4d, the F test for Ho: CB =t in Theorem 8.4¢ is a 
full—reduced-model test (see Problem 8.24 for a partial result). 


(8.36) 


8.5 TESTS ON B; AND a‘B 


We consider tests for a single 8; or a single linear combination a’B in Section 8.5.1 
and tests for several B;’s or several a,B’s in Section 8.5.2. 


8.5.1 Testing One £; or One a’B 


Tests for an individual £; can be obtained using either the full—reduced- 
model approach in Section 8.2 or the general linear hypothesis approach in 
Section 8.4 The test statistic for Hp: 6, = 0 using a full—reduced—model is given 
in (8.20) as 


_ BX'y- BiXy 


B= Se hay 


(8.37) 


which is distributed as F(1, — k — 1) if Ho is true. In this case, 6, is the last B, so 


that B is partitioned as B = G ) and X is partitioned as X = (Xj, x;), where x; is 
k 


8.5 TESTS ON 8; and a’B 205 


the last column of X. Then X; in the reduced model y = X| Bj + € contains all the 
columns of X except the last. 

To test Ho : B; = 0 by means of the general linear hypothesis test of Ho : CB = 0 
(Section 8.4.1), we first consider a test of Ho: a'B = 0 for a single linear combi- 
nation, for example, a’ B = (0, 2, —2, 3,1)B. Using a’ in place of the matrix C in 
CB = 0, we have q = 1, and (8.27) becomes 


_ (a'By [al (XX) lala B (a'By 
SSE/(n — k — 1) s2al(X'X) 1a’ 


(8.38) 


where 5? = SSE/(n—k-— 1). The F' statistic in (8.38) is distributed as 
F(l,n—k— 1) if Ho: a’B = 0 is true. 

To test Ho : B; = 0 using (8.38), we define a’ = (0, ...,0, 1,0, ...,0), where the 
1 is in the jth position. This gives 


Q2 
pe 


= 
SS ij 


(8.39) 


where g jj is the jth diagonal element of (X'X) |. If Ho: B; = 0 is true, F in (8.39) is 
distributed as F(1,n —k — 1). We reject Ho : B; =0 if F> Foin—K-1 OF, equiva- 
lently, if p < a, where p is the p value for F. 
By Theorem 8.4d (see also Problem 8.25), the F statistics in (8.37) and (8.39) are the 
same (for j = k). This confirms that (8.39) tests Ho : B; = 0 adjusted for the other f’s. 
Since the F statistic in (8.39) has 1 and n — k — 1 degrees of freedom, we can 
equivalently use the ¢ statistic 


pe (8.40) 
SV/8 jj 


to test the effect of 8; above and beyond the other £’s (see Problem 5.16). We reject 
A : Bj = Vif || > ta/2n-x-1 OF, equivalently, if p < a, where p is the p value. Fora 
two-tailed ¢ test such as this one, the p value is twice the probability that tn — k — 1) 
exceeds the absolute value of the observed t. 

For j= 1, (8.40) becomes t= B,/sV8u, which is not the same as t= 


B,/ [sy og = in (6.14). Unless the x’s are orthogonal, g7) # 30, (tj — X41)”. 


8.5.2 Testing Several B,’s or a/f's 


We sometimes want to carry out several separate tests rather than a single joint test of 
the hypotheses. For example, the test in (8.40) might be carried out separately for 
each B;,i=1,...,k rather than the joint test of Ho : B, = 0 in (8.5). Similarly, 
we might want to carry out separate tests for several (say, d) a;3’s using (8.38) 


206 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 


rather than the joint test of Hp : CB = 0 using (8.27), where 


In such situations there are two different a levels, the overall or familywise a level 
(ay) and the @ level for each test or comparisonwise a level (@,). In some cases 
researchers desire to control a. when doing several tests (Saville 1990), and so no 
changes are needed in the testing procedure. In other cases, the desire is to control 
a. In yet other cases, especially those involving thousands of separate tests (e.g., micro- 
array data), it makes sense to control other quantities such as the false discovery rate 
(Benjamini and Hochberg 1995, Benjamini and Yekutieli 2001). This will not be dis- 
cussed further here. We consider two ways to control ay when several tests are made. 

The first of these methods is the Bonferroni approach (Bonferroni 1936), which 
reduces a, for each test, so that ay is less than the desired level of a*. As an 
example, suppose that we carry out the k tests of Hoj: Bj = 0, j = 1,2, ...,k. Let 
E; be the event that the jth test rejects Ho; when it is true, where P(E;) = a. 
The overall ay can be defined as 


ay = P(reject at least one Ho; when all Ho; are true) 
= P(E; or Ey... or Ex). 


Expressing this more formally and applying the Bonferroni inequality, we obtain 


ay = P(E, UE, U--- U EX) 


k k 
< SO PE) = >> a = kay. Onn 


j=l j=l 


We can thus ensure that ay is less than or equal to the desired a* by simply setting 
a, = a /k. Since ay in (8.41) is at most a@*, the Bonferroni procedure is a conserva- 
tive approach. 

To test Ho; : B; =0,j=1,2,...,k, with ay < a”, we use (8.40) 


t= B; , (8.42) 
SV/8 jj 


and reject Ho; if |tj| > tox /2k,n—k—1- Bonferroni critical values t,x /2%,, are available in 
Bailey (1977). See also Rencher (2002, pp. 562—565). The critical values fg /2,,, can 
also be found using many software packages. Alternatively, we can carry out the test 
by the use of p values and reject Ho; if p < a*/k. 


8.5 TESTS ON 8; and a’B 207 
More generally, to test Ho; : a;B = Ofori = 1,2, ...,d with ay < a*, we use (8.38) 


—_ (alBy [al(X'X)!aj] al B 


F, ; (8.43) 


S 


and reject Ho; if Fi = Fo /d,1,n—x—1- The critical values F*/q are available in many 
software packages. To use p values, reject Ho; if p < a*/d. 

The above Bonferroni procedures do not require independence of the B;'s; they are 
valid for any covariance structure on the B;’s. However, the logic of the Bonferroni 
procedure for testing Ho;: a8 =0 for i= 1,2, ...,d requires that the coefficient 
vectors a1, @2,..., aq be specified before seeing the data. If we wish to choose 
values of a; after looking at the data, we must use the Scheffé procedure described 
below. Modifications of the Bonferroni approach have been proposed that are less 
conservative but still control a For examples of these modified procedures, see 
Holm (1979), Shaffer (1986), Simes (1986), Holland and Copenhaver (1987), 
Hochberg (1988), Hommel (1988), Rom (1990), and Rencher (1995, Section 
3.4.4). Comparisons of these procedures have been made by Holland (1991) and 
Broadbent (1993). 

A second approach to controlling a, due to Scheffé (1953; 1959, p. 68) yields 
simultaneous tests of Ho: a'B =O for all possible values of a including those 
chosen after looking at the data. We could also test Ho: a'B = t for arbitrary t. For 
any given a, the hypothesis Ho : a’B = 0 is tested as usual by (8.38) 


oa (a’ By [a(X'X)'a]‘a'B 


s2 


(a' By 
~ Sal(X'X) Ia’ oe 


but the test proceeds by finding a critical value large enough to hold for all possible a. 
Accordingly, we now find the distribution of max, F. 


Theorem 8.5 


Gi) The maximum value of F in (8.44) is given by 


@'py  _ BX'XB 


a s2a(X/X) as? 


(8.45) 


Gi) If y is N,(XB,c?D, then B'X’XB/(k+1)s?_ is distributed as 
F(k +1,n—k—1). Thus 
1p 2 
a (a’B) 
a s2al(X’/X) !a(k + 1) 


is distributed as F(k + 1,n—k-— 1). 


208 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 
PROOF 


(i) Using the quotient rule, chain rule, and Section 2.14.1, we differentiate 
(a’ py /a'(X'X)~'a with respect to a and set the result equal to 0: 


0 @lByY_[a'(X'X)'a}2(a'B)B- @'BP2AX'XY'a _ 


Oa a/(X’X) | a [a'(X’X)!a]- 


Multiplying by [a’(X’X)~'a]’/2a’B and treating 1 x 1 matrices as scalars, we 
obtain 


[a'(X’X) ‘a] B—a'B(X’X) 'a=0, 


si a'(X’X)'a 
a’B 


where c = a/(X’X)~'a/a’B. Substituting a = cX’XB into (8.44) gives 


X’XB = cX’XB, 


py (cB'X'XBy _ (BX'XBY _ p'X'xXB 
a s2a!(X’X)!a s2cB’X’X(X’X)!cX’/XB 82? BX’XB cers 
(i) Using C = Ik41 in (8.27), we have, by Theorem 8.4b (11), that 


_ pXx'xB 


ate eer: 
(k + 1)s2 


is distributed as F(k + 1,n—k-— 1). 


By Theorem 8.5(ii), we have 


1p2 
P| max 3 e) > Fok pa | = = 
a s2al(X’X) alk + 1) ns 


(a’ B) 
P | max ——————— > (k + 1) Fx kat n—k-1] = a". 
aX aX Ka = AF DF et bt tn-k| = o 


Thus, to test Hp : a’ B = 0 for any and all a (including values of a chosen after seeing 
the data) with a¢<a*, we calculate F in (8.44) and reject Ho if 
FS (K+ VF ox k¢i,n-k-1- 

To test for individual 6s using using Scheffé’s procedure, we set 
a’ = (0,...,0,1,0,...,0) with a | in the jth position. Then F in (8.44) reduces to 
F= BF /38 i in (8.39), and the square root is t; = Bi /s/Bi in (8.42). By Theorem 
8.5, we reject Hy: a’B = B, = Oif |G] > JA + D For ctiyn—k-1 

For practical purposes [k < (n — 3)], we have 


to /2k, n-k-1 << VK + DF ot et 1,n-k-1> 


8.6 CONFIDENCE INTERVALS AND PREDICTION INTERVALS 209 


and thus the Bonferroni tests for individual B;’s in (8.42) are usually more powerful 
than the Scheffé tests. On the other hand, for a large number of linear combinations 
a’ B, the Scheffé test is better since (k + LF o*, ¢+1, n-4-1 is constant, while the critical 
value F'4* /¢,1,.-x-1 for Bonferroni tests in (8.43) increases with the number of tests d 
and eventually exceeds the critical value for Scheffé tests. 

It has been assumed that the tests in this section for Ho : B; = 0 are carried out 
without regard to whether the overall hypothesis Hp : B, = 0 is rejected. However, if 
the test statistics t; = B;/ 5/8 iJ = 1,2,...,k, in (8.42) are calculated only if 
Ho : B, = is rejected using F in (8.5), then clearly ayis reduced and the conservative 


critical values fg /2¢, n-x—-1 and J (K+ 1)Fo* e41,n-k-1 become even more conserva- 
tive. Using this protected testing principle (Hocking 1996, p. 106), we can even use 
the critical value fy /2,n—4—1 forall k tests and a will still be close to a*. [For illustrations 
of this familywise error rate structure, see Hummel and sligo (1971) and Rencher and 
Scott (1990).] A similar statement can be made for testing the overall hypothesis 
Ho : CB = 0 followed by t tests or F tests of Ho : ¢; 8 = 0 using the rows of C. 


Example 8.5.2. We test Ho: : 8B; = 0 and Hoo : B, = 0 for the data in Table 7.1. 
Using (8.42) and the results in Examples 7.3.1(a), 7.33 and 8.1, we have 


By 3.0118 3.0118 


8/811 /2.8288/.16207 _.67709 


Ba 12855 1.2855 _ 
> S\/S V2.8288/.08360 0.48629 


ty = 4.448, 


2.643. 


Using a=.05 for each test, we reject both Ho; and Ho2 because to25,9 = 2.262. The 
(two-sided) p values are .00160 and .0268, respectively. If we use a = .05/2 = .025 
for a Bonferroni test, we would not reject Ho since p = .0268 > .025. However, 
using the protected testing principle, we would reject Ho2 because the overall 
regression hypothesis Hp : B, = 0 was rejected in Example 8.1. 


8.6 CONFIDENCE INTERVALS AND PREDICTION INTERVALS 


In this section we consider a confidence region for B, confidence intervals for 
B;, a’ B, E(y), and ee and prediction intervals for future observations. We assume 


throughout Section 8.6 that y is N,(XB, ol). 


8.6.1 Confidence Region for B 


If C is equal to I and t is equal to B in (8.35), g becomes k + 1, we obtain a central F 
distribution, and we can make the probability statement 


P[(B — B)'X'X(B — B)/(k + Vs? < Faksin-k-1] = 1— a, 


210 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 


where s* = SSE/(n — k — 1). From this statement, a 100(1 — a)% joint confidence 
region for Bo, B,, .--, 8, in B is defined to consist of all vectors B that satisfy 


(B— B)'X'X(B — B) < (k+ Ds? Fastin tt (8.46) 


For k = 1, this region can be plotted as an ellipse in two dimensions. For k > 1, the 
ellipsoidal region in (8.46) is unwieldy to interpret and report, and we therefore con- 
sider intervals for the individual £,’s. 


8.6.2 Confidence Interval for £; 


If B; # 0, we can subtract ; in (8.40) so that 4; = (8; — B;)/s,/g7 has the central r 
distribution, where g,; is the jth diagonal element of (X’X)~!. Then 


Bi - B; 
P —be/2,n—k-1 < ! = < la/2,n—k-1 =l-a. 


SVS ii 


Solving the inequality for B; gives 
P(B; — ba/2,n—k-154/8 5 S BS B + tejon—K-18V8 jj) = 1— a. 


Before taking the sample, the probability that the random interval will contain £; is 
1 — a. After taking the sample, the 100(1 — @)% confidence interval for 6; 


By + ta/2,n-t-15V Bi (8.47) 


is no longer random, and thus we say that we are 100(1 — a@)% confident that the 
interval contains ;. 

Note that the confidence coefficient | — a holds only for a single confidence inter- 
val for one of the £,’s. For confidence intervals for all k + 1 of the £’s that hold 
simultaneously with overall confidence coefficient 1 — a, see Section 8.6.7. 


Example 8.6.2. We compute a 95% confidence interval for each 6; using y> in the 
chemical reaction data in Table 7.4 (see Example 8.2a). The matrix (X'X)"! (see 


the answer to Problem 7.52) and the estimate B have the following values: 


65.37550 —0.33885 —0.31252 —0.02041 
—0.33885 0.00184 0.00127 —0.00043 


! -1 _ 
a ie —0.31252 0.00127 0.00408 —0.00176 }’ 
—0.02041 —0.00043 —0.00176 0.02161 
—26.0353 
p- 0.4046 
- 0.2930 


1.0338 


8.6 CONFIDENCE INTERVALS AND PREDICTION INTERVALS 211 
For B,, we obtain by (8.47), 


By = tors 155/811 


4046 + (2.1314)(4.0781)v .00184 
4046 + .3723, 
(.0322, .7769). 


For the other B,’s, we have 


By: —26.0353 + 70.2812 
( — 96.3165, 44.2459), 
Bo: .2930 + .5551 
(— .2621, .8481), 
B3: 1.0338 + 1.27777 
( — .2439, 2.3115). 


The confidence coefficient .95 holds for only one of the four confidence intervals. For 
more than one interval, see Example 8.6.7. 


8.6.3 Confidence Interval for a’ B 


If a'B # 0, we can subtract a’ B from a’ B in (8.44) to obtain 


a (a'B — a’ By 
s2al(X’X) 1a’ 


which is distributed as F(1,n — k — 1). Then by Problem 5.16, 


_ a'B —a'B 
sv/al(X'X)/a 


is distributed as tn — k — 1), and a 100(1 — a)% confidence interval for a single 


value of a’ B is given by 
a! B + ta/on-e18\/ a/(X'X) fa. (8.49) 


(8.48) 


8.6.4 Confidence Interval for E(y) 


Let xo = (1,.x01,.X02, ..-, Xox)’ denote a particular choice of x = (1,x1,x2, ...,x,)'. 
Note that xg need not be one of the x’s in the sample; that is, xg need not be a row 
of X. If xp is very far outside the area covered by the sample however, the prediction 
may be poor. Let yo be an observation corresponding to Xo. Then 


yo =XBt+e, 


212 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 


and [assuming that the model is correct so that E(e) = 0] 
E(y0) = XB. (8.50) 


We wish to find a confidence interval for E(yo), that is, for the mean of the distri- 
bution of y-values corresponding to Xo. 

By Corollary 1 to Theorem 7.6d, the minimum variance unbiased estimator of 
E(yo) is given by 


EQ) = xB. (8.51) 


Since (8.50) and (8.51) are of the form a’B and a’ B. respectively, we obtain a 
100(1 — a)% confidence interval for E(yo) = xpB from (8.49): 


XB F ta/2n—b-18\) X(X'X)!X0. (8.52) 


The confidence coefficient 1—qa for the interval in (8.52) holds only for a single 
choice of the vector Xp. For intervals covering several values of Xo or all possible 
values of Xp, see Section 8.6.7. 

We can express the confidence interval in (8.52) in terms of the centered model in 
Section 7.5, y,; = a+ Bi (x01 — X,)+ &;, where xo; = (X01, X02, ---,Xox)’ and X; = 
(X1,X2, ...,X4)'. [We use the notation xo, to distinguish this vector from xo = 
(1,.x01,X02, ---»Xox)’ above.] For the centered model, (8.50), (8.51), and (8.52) become 


E(yo) = a+ Bi (x01 — X1), (8.53) 
EQ) = 3+ Bi (xo — %1), (8.54) 


y 1 
¥+ Bi (xo — X1) + tarat-ssy) — + (Kor — X1)(X.X.) (Kor — X11). (8.55) 


Note that in the form shown in (8.55), it is clear that if x; is close to x; the interval is 
narrower; in fact, it is narrowest for x9; = x. The width of the interval increases as the 
distance of Xo; from x, increases. 

For the special case of simple linear regression, (8.50), (8.51), and (8.55) reduce to 


E(o) = Bo + B1%0, (8.56) 
EQ0) = By + Bix0, (8.57) 


(xo — x) 


mat 2? 
ag (x; — X) 


ae wt 1 
Bo + Bix0 + wranasy ae > (8.58) 


where s is given by (6.11). The width of the interval in (8.58) depends on how far xo is 
from x. 


8.6 CONFIDENCE INTERVALS AND PREDICTION INTERVALS 213 


Example 8.6.4. For the grades data in Example 6.2, we find a 95% confidence 
interval for E( yo), where x) = 80. Using (8.58), we obtain 


Be wks 1 — 58.056) 
Bo + By (80) + ciel aaa ; 
80.5386 + 2.1199(13.8547)(.2832), 
80.5386 + 8.3183, 
(72.2204, 88.8569). 


8.6.5 Prediction Interval for a Future Observation 


A “confidence interval” for a future observation yo corresponding to Xo is called a 
prediction interval. We speak of a prediction interval rather than a confidence 
interval because yo is an individual observation and is thereby a random variable 
rather than a parameter. To be 100(1 — a@)% confident that the interval contains yo, 
the prediction interval will clearly have to be wider than a confidence interval for 
the parameter E( yo). 

Since yo = xpB + &o, we predict yo by $9 = x) B, which is also the estimator of 
E(yo) = Xo. The random variables yo and 39 are independent because yo is a 
future observation to be obtained independently of the n observations used to 


compute jp = x) B. Hence the variance of yo — jig is 
var(yo — $9) = var(yo — XpB) = var(xpB + €0 — xp B). 
Since xpB is a constant, this becomes 


var(yo — ¥o) = var(e9) + var(x), B) = 0° +0°x)(X'X) !xq 
= 0° [1+ x9(X'X) |x], (8.59) 


which is estimated by s?[1 + xh (XX)! xo]. It can be shown that E(yo — Jy) = 0 and 
that 5? is independent of both yo and vo = x) B. Therefore, the ¢ statistic 


go oD (8.60) 


54/1 + xh(X'X)!xo 


is distributed as t(n — k — 1), and 


yo — ¥ 
—te/2n—k-1 S ° 
s/ 1 4.x(X/X)-xo 


P= S te/2,n ei] =l-a. 


214 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 


The inequality can be solved for yo to obtain the 100(1 — a)% prediction interval 


30 — ta, n—k-184/ 1 + x9(X’X)! x0 < Yo < Ho + ba/2, n—k-194/ 1 + xh(X"X) 1X0 


or, using Jo = x)B, we have 


XB + tajrn——184f 1 + x5(X'X) |x. (8.61) 


Note that the confidence coefficient 1—a@ for the prediction interval in (8.61) holds 
for only one value of xo. 

In 1 + xg (X’ X)"!xo, the second term, x) (X'X)' xo, is typically much smaller than 
1 (provided k is much smaller than n) because the variance of jo = xj B is much less 
than the variance of yo. [To illustrate, if X’X were diagonal and xo were in the area 
covered by the rows of X, then xi (X’X) Xo would be a sum with k+ 1 terms, 
each of the form Xj oo Xp, which is of the order of 1 /n.] Thus prediction intervals 
for yo are generally much wider than confidence intervals for E(yo) = xpB. 

In terms of the centered model in Section 7.5, the 100(1 — a)% prediction interval 
in (8.61) becomes 


‘ 1 
y+ By (xo — X1) + tjrni-ssy + Fs + (xo1 — ¥1)/(X,X.) (x01 — ¥1). (8.62) 


For the case of simple linear regression, (8.61) and (8.62) reduce to 


(xo — x)° 


x Creed ae 


ke 1 
Bo + BixX0 + ta/2, way) + - + (8.63) 


where s is given by (6.11). In (8.63), it is clear that the second and third terms within 
the square root are much smaller than | unless xo is far removed from the interval 
bounded by the smallest and largest x’s. 

For a prediction interval for the mean of qg future observations, see Problem 8.30. 


Example 8.6.5. Using the data from Example 6.2, we find a 95% prediction interval 
for yo when xo = 80. Using (8.63), we obtain 


1 4, 80 = 58.056)" 
18 19530.944 


80.5386 + 2.1199(13.8547)(1.0393), 
80.5386 + 30.5258, 
(50.0128, 111.0644). 


By + B,(80) + toa + 


8.6 CONFIDENCE INTERVALS AND PREDICTION INTERVALS 215 


Note that the prediction interval for yo here is much wider than the confidence interval 
for E(yo) in Example 8.6.4. 


8.6.6 Confidence Interval for 0? 


By Theorem 7.6b(ii), (1 — k — 1)s*/o? is x(n —k — 1). Therefore 


(n—k — 1)s? 
ss Xia), n-k-1 = a2 = NGPA RA =1-a, (8.64) 


where x, /2,n-k-1 is the upper a/2 percentage point of the chi-square distribution and 
XG jo n=k21 is the lower a/2 percentage point. Solving the inequality for 07 yields 
the 100(1 — a)% confidence interval 


(w—k- Vs’ — an a k- Ve 


(8.65) 


2 ~ 2 : 
X a/2,n—k-1 X1-a/2, n-k-1 


A 100(1 — a)% confidence interval for o is given by 


—k=— 1s? —k=— 1s? 
ES Dey ee (8.66) 
X a/2,n-k-1 X1—-a/2, n-k-1 
8.6.7 Simultaneous Intervals 


By analogy to the discussion of testing several hypotheses (Section 8.5.2), when 
several intervals are computed, two confidence coefficients can be considered: 
familywise confidence (1 — ap) and individual confidence (1 — a). Familywise confi- 
dence of 1 — aymeans that we are 100(1 — a¢)% confident that every interval contains 
its respective parameter. 

In some cases, our goal is simply to control 1— a, for each one of several confi- 
dence or prediction intervals so that no changes are needed to expressions (8.47), 
(8.49), (8.52), and (8.61). In other cases the desire is to control 1— ay. To do so, 
both the Bonferroni and Scheffé methods can be adapted to the situation of multiple 
intervals. In yet other cases we may want to control other properties of multiple inter- 
vals (Benjamini and Yekutieli 2005). 

The Bonferroni procedure increases the width of each individual interval so that 
1—ay for the set of intervals is greater than or equal to the desired value 1—a*. 
As an example suppose that it is desired to calculate the k confidence intervals for 
B,, ---, By. Let E; be the event that the jth interval includes B;, and Ej be the comp- 
lement of that event. Then by definition 


1— ap = P(E, NED... EX) 
=1-—P(E(UESU...U Ef). 


216 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 
Assuming that P(E;) = a, forj = 1,..., k, the Bonferroni inequality now implies that 
1 — ay > 1 — kay. 


Hence we can ensure that 1 — ayis greater than or equal to the desired 1 — a* by setting 
1 — a, = 1 — a*/k for the individual intervals. 


Using this approach, Bonferroni confidence intervals for 6), Bo, ... , Bx are given by 
By + tor j2e, n—e-18/Bp, J=1,2,... 5k, (8.67) 


where gj is the jth element of (X’X)~!. Bonferroni ft values t,« /2x are available in 
Bailey (1977) and can also be obtained in many software programs. For example, a 
probability calculator for the 7, the F, and other distributions is available free from 
NCSS (download at www.ncss.com). 

Similarly for d linear functions a/ B, a5 B, ..., a/,B (chosen before seeing the data), 
Bonferroni confidence intervals are given by 


alB + terrane 1s\/a(X'X) 1a; i=1,2,...,. (8.68) 


These intervals hold simultaneously with familywise confidence of at least 1 — a*. 
Bonferroni confidence intervals for E(yo) = xB for a few values of Xo, say, 
Xo1, X02, ---, Xoqd are given by 


XB tee 2d, n—k-15\/ M(X'X) xo, i= 1,2, ...,. (8.69) 


[Note that x9, here differs from Xo; in (8.53)—(8.55).] 


For simultaneous prediction of d new observations yo1, yo2,---, Yo at d values of 
Xo, Say, X01, X02, ---, Xo¢, We can use the Bonferroni prediction intervals 
x). B + tat odsnt 154) 1 + xh (XX) 1x0; ba 2s red (8.70) 


[see (8.61) and (8.69)]. 

Simultaneous Scheffé confidence intervals for all possible linear functions a’B 
(including those chosen after seeing the data) can be based on the distribution of 
max, F [Theorem 8.5(ii)]. Thus a conservative confidence interval for any and all a’ B is 


a'B + s/ (k + 1) Foe et, n-e-1a/ (XX) ba. (8.71) 


The (potentially infinite number of) intervals in (8.71) have an overall confidence 
coefficient of at least 1—a*. For a few linear functions, the intervals in (8.68) will 
be narrower, but for a large number of linear functions, the intervals in (8.71) will 
be narrower. A comparison of fg* /2d,n—x—-1 and J (K+ 1) For c41,n—k-1 Will show 
which is preferred in a given case. 


8.7 LIKELIHOOD RATIO TESTS 217 


For confidence limits for E(vo) = xp for all possible values of xo, we use (8.71): 


xB + sy/k + D) Fox k41n—-k-1%(X'X) Xo. (8.72) 


These intervals hold simultaneously with a confidence coefficient of 1 — a*. Thus, 
(8.72) becomes a confidence region that can be applied to the entire regression 
surface for all values of x9. The intervals in (8.71) and (8.72) are due to Scheffé 
(1953; 1959, p. 68) and Working and Hotelling (1929). 

Scheffé-type prediction intervals for yor, yoo, ... , You are given by 


x),B + 54/ dF, 4,110 + x),(X'X) | xoj] i= 1, 2, saey d (8.73) 


(see Problem 8.32). These d prediction intervals hold simultaneously with overall 
confidence coefficient at least 1 — a*, but note that dF o« ¢,n—-~—1 is not constant. It 
depends on the number of predictions. 


Example 8.6.7. We compute 95% Bonferroni confidence limits for B,, B,, and B;, 
using y> in the chemical reaction data in Table 7.4; see Example 8.6.2 for (XX)! 
and B. By (8.67), we have 


By + to25/3,155)/811 
4056 + (2.6937)(4.0781)V/ .00184 
.4056 + .4706 
(— .0660, .8751), 


By: 2930 + .7016 


(— .4086, .9946) , 
By: 1.0338 + 1.6147 


( — .5809, 2.6485). 


These three intervals hold simultaneously with confidence coefficient at least .95. 


8.7. LIKELIHOOD RATIO TESTS 


The tests in Sections 8.1, 8.2, and 8.4 were derived using informal methods based 
on finding sums of squares that have chi-square distributions and are independent. 
These same tests can be obtained more formally by the likelihood ratio approach. 
Likelihood ratio tests have some good properties and sometimes have optimal 
properties. 

We describe the likelihood ratio method in the simple context of testing 
Ho : B= 0 versus H, : B # 0. The likelihood function L(B, 07) was defined in 
Section 7.6.2 as the joint density of the y’s. For a random sample 
Y = (1,92, ---,Yn)’ with density N,(XB, 071), the likelihood function is given 


218 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 


by (7.50) as 


1 ' 2 
2a —(y-XB)'(y—XB)/20 
L(B,o7) = (ney? e ; (8.74) 


The likelihood ratio method compares the maximum value of L(B, a”) restricted 
by Ho: B = 0 to the maximum value of L(B, a”) under H;: B; # 0, which is essen- 
tially unrestricted. We denote the maximum value of L(B,c7) restricted by B = 0 
as maxy, L(B, a”) and the unrestricted maximum as maxy, L(B, a). If B is equal 
(or close) to 0, then maxy, L(B, a”) should be close to maxy, L(B, a’). If 
maxy, L(B, a”) is not close to maxy, L(B, a7), we would conclude that 
y = (1,92, -.-, Yn) apparently did not come from N,,(XB, 071) with B = 0. 

In this illustration, we can find maxy,, L(B, 77) by setting B = 0 and then estimating 
a as the value that maximizes L(0, 07). Under H, : B # 0, both B and a? are esti- 
mated without restriction as the values that maximize L(B, o”). [In designating the 
unrestricted maximum as max, L(B, a”), we are ignoring the restriction in H that 
B #0.) 

It is customary to describe the likelihood ratio method in terms of maximizing L 
subject to w, the set of all values of B and o satisfying Ho, and subject to Q, the set of 
all values of B and o without restrictions (other than natural restrictions such as 
a” > 0). However, to simplify notation in cases such as this in which H, includes 
all values of B except 0, we refer to maximizing L under Ho and A. 

We compare the restricted maximum under Hp with the unrestricted maximum 
under Hj by the likelihood ratio 


_ maxy, L(B, a’) 
max, L(B, 0) 


_ max L(0, a”) 
~ max L(B, 07) ° 


(8.75) 


It is clear that 0 < LR < 1, because the maximum of L restricted to B = 0 cannot 
exceed the unrestricted maximum. Smaller values of LR would favor H,, and 
larger values would favor Hp. We thus reject Hp if LR <c, where c is chosen so 
that P(LR < c) = a if Ho is true. 

Wald (1943) showed that, under Ho 


—21nLR is approximately y7(v) 


for large n, where v is the number of parameters estimated under H; minus the 
number estimated under Ho. In the case of Hp : B = 0 versus H, : B # 0, we have 
v=k+2—1=k+41 because B and co? are estimated under H, while only o is 
estimated under Ho. In some cases, the a approximation is not needed because 
LR turns out to be a function of a familiar test statistic, such as t or F, whose exact 
distribution is available. 


8.7 LIKELIHOOD RATIO TESTS 219 


We now obtain the likelihood ratio test for Hp : B = 0. The resulting likelihood 
ratio is a function of the F statistic obtained in Problem 8.6 by partitioning the 
total sum of squares. 


Theorem 8.7a. If y is N,(XB, 071), the likelihood ratio test for Hy : B = 0 can be 


based on 


BXy/kK+) 
(y'y — BX'y)/(n—k—1) 


We reject Ho if F > Fakti,n—k-1- 


Proor. To find maxy, L(B, ao”) = max L(B, a7), we use the maximum likelihood 
estimators B = (X’X)~'X’y and 6 = (y — XB)'(y — XB)/n from Theorem 7.6a. 
Substituting these in (8.74), we obtain 


max L(B, o?) = max L(B, 07) = L(B, &) 


— 1} -e-xpiy-xp/2# 
(2762)"/2 


ntl2e-n/2 


= - ane (8.76) 
2m)" |(y — XBy— XB) 


To find maxy, L(B, 7”) = max L(0, a”), we solve 0 InL(0, 7*)/dc* = 0 to obtain 


ae yy (8.77) 
n 


Then 


max L(B, a”) = max L(0, 07) = LO, &) 
0 
= i eV 9/265 
(2arep)"/? 
nt!2e-n/2 


= 8.78 
(2a)"/?(y'yy"? ue 


220 TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 
Substituting (8.76) and (8.78) into (8.75), we obtain 


a x 7 n/2 
(y — XB)'(y — XB) 


__ maxy, L(B, a”) _ 


maxy, L(B, 77) yy 
1 n/2 
~ 11+ (k+ DF/(n—k— “| ; Gel”) 


where 


BXy/K+1) 
(y'y — BX'y)/(n—k—1) 
Thus, rejecting Hy: B = 0 for a small value of LR is equivalent to rejecting Ho for a 
large value of F. 


We now show that the F test in Theorem 8.4b for the general linear hypothesis 
Ho: CB = 0 is a likelihood ratio test. 


Theorem 8.7b. If y is V,(XB, 07D), then the F test for Hy : CB = 0 in Theorem 8.4b 
is equivalent to the likelihood ratio test. 


Proor. Under H; : CB # 0, which is essentially unrestricted, max;, L(B, 07) is 
given by (8.76). To find max, L(B, 07) = max L(B, 07) subject to CB = 0, we 
use the method of Lagrange multipliers (Section 2.14.3) and work with L(B, a”) 
to simplify the differentiation: 


v = InL(B, 7”) + A(CB — 0) 
2 (y— XB)(y — XB) 


AG? +A'C£. 


= "nam — 2 ine 
2 2 


Expanding (y — XB)'(y — XB) and differentiating with respect to B, A, and 0”, we 
obtain 


in = (2X'y — 2X’XB)/20° +CA=0, (8.80) 
Ov 

aq = CB = 0. (8.81) 
Ov a n ae 1 ( xX y( Xp) =0 (8.82) 
Oo? 202 202)? r Be ee) 


PROBLEMS 221 


Eliminating X and solving for B and a? gives 


By = B— (XX) '!CICX’'X) 'C}'CB, (8.83) 
1 : : 

a) = —(y — XBo)'(y — XBo) (8.84) 

=67+ *(CByICX’'X) Cy 'CB (8.85) 


(Problems 8.35 and 8.36), where @? = (y — XB)'(y — XB)/n and B = (X'X)"!X'y 
are the maximum likelihood estimates from Theorem 7.6a. Thus 


max L(B, a”) = L(By, %) 


1 


- e7 Y-XBo)'(¥-XBo)/265 
(2m)"/2 (any? 


nt!2e-n/2 


oar ™(558 ebileaexC) eB} 
and 


_ maxy, L(B, a”) 
max, L(B, 0) 
7 | SSE I" - 
SSE + (CBy [C(X’X)'C'}'CB 


l n/2 1 n/2 
7 =e 7 Se | 


where SSH = (CB) [C(X’X)'C’]-'CB, SSE = (y — XB)'(y — XB), and F is given 
in (8.27). 


PROBLEMS 


8.1 Show that SSR = BX’.X.B, in (8.1) becomes y’X.(X/.X.) !X/y as in (8.2). 


8.2 (a) Show that H,[I — (1/n)J] = H,, as in (8.3) in Theorem 8.la(i), where 
H, = X,(X).X,)'X’. 
(b) Prove Theorem 8. 1a(ii). 
(c) Prove Theorem 8. la(iii). 
(d) Prove Theorem 8. la(iv). 


222 


8.3 
8.4 
8.5 


8.6 


8.7 


8.8 


8.9 


8.11 


8.12 


TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 


Show that A; = B,X-X.B,/207 as in Theorem 8.1b(i). 
Prove Theorem 8.1b(ii). 


Show that E(SSR/k) = a7 + (1/k)B,X'.X.B;, as in the expected mean square 
column of Table 8.1. Employ the following two approaches: 

(a) Use Theorem 5.2a. 

(b) Use the noncentrality parameter in (5.19). 


Develop a test for Hp: B=90 in the model y= XfB+ , where y is 

N,(XB, 071). (It was noted at the beginning of Section 8.1 that this hypothesis 

is of little practical interest because it includes By = 0.) Use the partitioning 

yy = (y'y — B’X’y) + B'X’y, and proceed as follows: 

(a) Show that B’X’y = y’X(X'X)"!X’y and y/y — P’X’y = y'[I — X(X'X)! 
X’ly. 

(b) Let H = X(X’X)~'X’. Show that H and I —H are idempotent of rank 
k+ 1 and n—k — 1, respectively. 

(c) Show that y'Hy/o? is x7(k + 1, A1), where A; = B'X'XB/2c?, and that 
y'(I — Wy/o? is y*(n—k —1). 

(d) Show that y’Hy and y'(I — My are independent. 

(e) Show that 


BX'y —— yHy/(k+1 
ki Ds? yd—-My/m—k_-b 


is distributed as F(k + 1,n—k— 1, Aj). 


Show that HH; = H; and H,H = Hj, as in (8.15), where H and Hy are as 
defined in (8.11) and (8.12). 


Show that conditions (a) and (b) of Corollary 1 to Theorem 5.6c are satisfied 
for the sum of quadratic forms in (8.12), as noted in the proof of Theorem 
8.2b. 

Show that A, = B4[X)X2 — X4X\(X{X1)'X)X2]B)/207 as in Theorem 
8.2b(ii). 

Show that XX) — X5X.(X/X1)"'X{X is positive definite, as noted below 
Theorem 8.2b. 

Show that E[SS(B5|B,)/h] = 07 + BL[X5X2 — XX (XX1)'X, X21 B)/h 
as in Table 8.3. 


Find the expected mean square corresponding to the numerator of the F 
Statistic in (8.20) in Example 8.2b. 


Show that Bi = y and SS(B5) = ny”, as in (8.21) in Example 8.2c. 


8.20 


8.21 


8.22 
8.23 


8.24 


8.25 


8.26 
8.27 


8.28 


PROBLEMS 223 


In the proof of Theorem 8.2d, show that (BX + 65X5)(XiB, + X2B,) — 
(B, + BLAYXXi(B, + AB) = By[X)Xo — XX (XX1) 1X) X21. 
Express the test for Hp: B, = 0 in terms of R?, as in (8.25) in Theorem 8.3. 
Prove Theorem 8.4a(iv). 

Show that C(X’X)~!C’ is positive definite, as noted following Theorem 8.4b. 
Prove Theorem 8.4c. 


Show that in the model y = Xf + € subject to CB = 0 in (8.29), the estimator 

of B is B. = B— (X’X) '!C[C(X’X)"'C}'CB as in (8.30), where B = 

(X'X) |x’'y. Use a Lagrange multiplier A and minimize u= 

(y — XB)'(y — XB) + A'(CB — 0) with respect to B and A as follows: 

(a) Differentiate u with respect to A and set the result equal to 0 to obtain 
CB. = 0. 

(b) Differentiate uv with respect to B and set the result equal to 0 to obtain 


B. = B-A(X'X)'C'A, (1) 


where B = (X’X)"!X’y. 
(c) Multiply (1) in part (b) by C, use CB. = 0 from part (a), solve for A, and 
substitute back into (1). 


Show that B.X'XB. = B.X’y, thus demonstrating directly that the sum of 
squares due to the reduced model is BX’ y and that (8.31) holds. 


Show that for the general linear hypothesis Hp : CB = 0 in Theorem 8.4d, we 
have B'X'y — B.X'y = (CBY [C(X'X)"'!C’'}-'CB as in (8.32), where B. is as 
given in (8.30). 


Prove Theorem 8.4e. 


Prove Theorem 8.4f(iv) by expressing SSH and SSE as quadratic forms in the 
same normally distributed random vector. 


Show that the estimator for B in the reduced model y = XB + € subject to 
CB =t is given by B. = B— (X’X) '!C[C(X’X)'C] (CB — bt), where 
B= (X'X)!X’y. 

Show that B'X'y — B* X/y in (8.37) is equal to B?/gx in (8.39) (for j = k), as 
noted below (8.39). 

Obtain the confidence interval for a’ B in (8.49) from the f statistic in (8.48). 


Show that the confidence interval for xp in (8.52) is the same as that for the 
centered model in (8.55). 


Show that the confidence interval for By + B,xo in (8.58) follows from (8.55). 


224 


8.29 


8.30 


8.31 


8.32 


8.33 
8.34 
8.35 


8.36 


8.37 


8.38 


TESTS OF HYPOTHESES AND CONFIDENCE INTERVALS 
Show that t= (vo — 39)/s\/1 + xh(X’X) | xo in (8.60) is distributed as 
t(n —k —1). 


(a) Given that yp = by: = yoi/q is the mean of q future observations at xo, 
show that a 100(1—a@)% prediction interval for yy is given by 


x) B + te/2, n-n-i8y/1/q + xp(X’X)! xo. 


(b) Show that for simple linear regression, the prediction interval for yp in part (a) 


reduces to By + Byxo + ta/2,n-28y/ 1/q + 1/n + (0 — 92 /DL yi — 8P. 


Obtain the confidence interval for a” in (8.65) from the probability statement 
in (8.64). 


Show that the Scheffé prediction intervals for d future observations are given 
by (8.73). 


Verify (8.76)—(8.79) in the proof of Theorem 8.7a. 
Verify (8.80), 0v/OB = (2X'y — 2X’/XB)/207 + C'A. 


Show that the solution to (8.80)—(8.82) is given by Bo and a in (8.83) and 

(8.84). 

Show that (y — XBp)'(y — XBy) = n& + (CBY [C(X’X)'C']"'CB as in 

(8.85). 

Use the gas vapor data in Table 7.3. 

(a) Test the overall regression hypothesis Hp : B, = 0 using (8.5) [or (8.22)] 
and (8.23). 

(b) Test Ho : B, = B3, = 0, that is, that x; and x3 do not significantly contrib- 
ute above and beyond x, and x4. 

(c) Test Ho : B; = 0 for j = 1,2, 3,4 using ¢ in (8.40). Use t5/2 for each test 
and also use a Bonferroni approach based on f.95/g (or compare the p value 
to .05/4). 

(d) Using general linear hypothesis tests, test Hp : 8; = B. = 1283, = 12By, 
Aoi : By = Bo, Ho2 : By = 1283, Hos : B3 = By, and Hog : B; = By and B3 = By. 

(e) Find confidence intervals for B,, B,, 8; and 6, using both (8.47) and (8.67). 

Use the land rent data in Table 7.5. 

(a) Test the overall regression hypothesis Hp : B, = 0 using (8.5) [or (8.22)] 
and (8.23). 

(b) Test Ho : 8; = 0 forj = 1, 2, 3 using ¢; in (8.40). Use £.95/2 for each test and 
also use a Bonferroni approach based on f5/6 (or compare the p value to 
05/3). 

(c) Find confidence intervals for B,, 6), 8; using both (8.47) and (8.67). 

(d) Using (8.52), find a 95% confidence interval for E(yo) = x)B, where 
xy = (1,15,30,.5). 


8.39 


8.40 


PROBLEMS 225 


(e) Using (8.61), find a 95% prediction interval for yo = x)8+ e, where 
xo = (1,15,30,.5). 

Use y2 in the chemical reaction data in Table 7.4. 

(a) Using (8.52), find a 95% confidence interval for E(yo) = x,B, where 
xy = (1, 165, 32,5). 

(b) Using (8.61), find a 95% prediction interval for yo = xp B+ e, where 
XQ = (1,165,32,5). 

(c) Test Ho:28, = 2B, = B3 using (8.27). (This was done for y, in Example 
8.4.b.) 


Use y, in the chemical reaction data in Table 7.4. The full model with second- 
order terms and the reduced model with only linear terms were fit in Problem 
7.52. 


(a) Test Ho: By = Bs =--- = Bo = 0, that is, that the second-order terms are 
not useful in predicting y,. (This was done for yz in Example 8.2a.) 


(b) Test the significance of the increase in R? from the reduced model to the 
full model. (This was done for y in Example 8.3. See Problem 7.52 for 
values of RS 

(c) Find a 95% confidence interval for each of By, By, By, B3 using (8.47). 

(d) Find Bonferroni confidence intervals for B,, 6, 83; using (8.67). 

(e) Using (8.52), find a 95% confidence interval for E(yo) = x58, where 
Xp = (1,165,32,5). 

(f) Using (8.61), find a 95%, prediction interval for yo = xp8+ €, where 
Xp = (1,165,32,5). 


9 Multiple Regression: Model 
Validation and Diagnostics 


In Sections 7.8.2 and 7.9 we discussed some consequences of misspecification of the 
model. In this chapter we consider various approaches to checking the model and the 
attendant assumptions for adequacy and validity. Some properties of the residuals 
[see (7.11)] and the hat matrix are developed in Sections 9.1 and 9.2. We discuss out- 
liers, the influence of individual observations, and leverage in Sections 9.3 and 9.4. 

For additional reading, see Snee (1977), Cook (1977), Belsley et al. (1980), Draper 
and Smith (1981, Chapter 6), Cook and Weisberg (1982), Beckman and Cook 
(1983), Weisberg (1985, Chapters 5, 6), Chatterjee and Hadi (1988), Myers (1990, 
Chapters 5-8), Sen and Srivastava (1990, Chapter 8), Montgomery and Peck 
(1992, pp. 67-113, 159-192), Jgrgensen (1993, Chapter 5), Graybill and Iyer 
(1994, Chapter 5), Hocking (1996, Chapter 9), Christensen (1996, Chapter 13), 
Ryan (1997, Chapters 2, 5), Fox (1997, Chapters 11—13) and Kutner et al. (2005, 
Chapter 10). 


9.1 RESIDUALS 


The usual model is given by (7.4) as y= Xf+ € with assumptions E(e) = 0 
and cov(€) = ol, where y isn x 1, X isn x (K+ 1) of rank k+ 1 <n, and B is 
(k+ 1) x 1. The error vector € is unobservable unless B is known. To estimate € 
for a given sample, we use the residual vector 


é=y—XB=y-y (9.1) 


as defined in (7.11). The n residuals in (9.1), &;, &2,..., &,, are used in various plots 
and procedures for checking on the validity or adequacy of the model. 

We first consider some properties of the residual vector &. Using the least-squares 
estimator B = (x’X)'X’y in (7.6), the vector of predicted values ¥ = XB can be 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


227 


228 MULTIPLE REGRESSION: MODEL VALIDATION AND DIAGNOSTICS 


written as 


y = XB = X(X'X) 'X’y 
= Hy, (9.2) 


where H = X(X’X)~!X’ (see Section 8.2). The n x n matrix H is called the hat 

matrix because it transforms y to y. We also refer to H as a projection matrix for 

essentially the same reason; geometrically it projects y (perpendicularly) onto ¥ 

(see Fig. 7.4). The hat matrix H is symmetric and idempotent (see Problem 5.32a). 
Multiplying X by H, we obtain 


HX = X(X’X) 'X’X = X. (9.3) 
Writing X in terms of its columns and using (2.28), we can write (9.3) as 


HX = H(j,xi, ... x,) = (Hj, Hx, ..., Hx,), 


so that 
j= Hj, x;= Hx, i= 1,2,...,k. (9.4) 


Using (9.2), the residual vector € (9.1) can be expressed in terms of H: 


= (I- Dy. (9.5) 
We can rewrite (9.5) to express the residual vector & in terms of €: 


é = (I1— Wy = (1— H\(XB + &) 
= (XB — HXf) + (1- We 
=(XB-XB)+-HWe [by (.3)] 
=(I- Me. (9.6) 


In terms of the elements h;; of H, we have €; = €; — ae hyej, i= 1,2,..., n. Thus, 
if the h;;’s are small (in absolute value), é is close to e. 

The following are some of the properties of € (see Problem 9.1). For the first four, 
we assume that E(y) = XB and cov(y) = ol: 


E(é) =0 (9.7) 


9.1 RESIDUALS 229 


cov(é) = o’ [I — X(X'X) !X’] = °° A —- H) (9.8) 
cov(é, y) = o° [I — X(X’X)_'X’] = 0 (I— H) (9.9) 
cov(é, y) = O (9.10) 
é=) &/n=e#j/n=0 (9.11) 

i=1 
é'y = SSE = y'[I — X(X’X) | X'ly = y'd— Dy (9.12) 
ey =0 (9.13) 
2X =0' (9.14) 


In (9.7), the residual vector has the same mean as the error term ¢€, but in (9.8) 
cov(é) = o*(I — H) differs from the assumption cov(€) = o°I. Thus the residuals 
&, &2,...,€, are not independent. However, in many cases, especially if n is large, 
the h,,;’s tend to be small (for i 4 j), and the dependence shown in o (I — H) does 
not unduly affect plots and other techniques for model validation. Each é; is seen 
to be correlated with each y; in (9.9), but in (9.10) the é;’s are uncorrelated with 
the ¥;’s. 

Some sample properties of the residuals are given in (9.11)—(9.14). The sample 
mean of the residuals is zero, as shown in (9.11). By (9.12), it can be seen that & 
and y are correlated in the sample since é’y is the numerator of 


oe é'(y — yi) _ é'y . 
© VEAY—Wa-w VEAL — WY — WD 


However, € and y are orthogonal by (9.13), and therefore 
Similarly, by (9.14), € is orthogonal to each column of X and 


re, =0, i=1,2,...,k. (9.16) 


230 MULTIPLE REGRESSION: MODEL VALIDATION AND DIAGNOSTICS 


Figure 9.1 Ideal residual plot when model is correct. 


If the model and attendant assumptions are correct, then by (9.15), a plot of the 
residuals versus predicted values, (&, 91), (€2, ¥2),---5(En, ¥n), Should show no sys- 
tematic pattern. Likewise, by (9.16), the k plots of the residuals versus each of 
X1, X2,...,X,% Should show only random variation. These plots are therefore useful 
for checking the model. A typical plot of this type is shown in Figure 9.1. It may 
also be useful to plot the residuals on normal probability paper and to plot residuals 
in time sequence (Christensen 1996, Section 13.2). 

If the model is incorrect, various plots involving residuals may show departures 
from the fitted model such as outliers, curvature, or nonconstant variance. The 
plots may also suggest remedial measures to improve the fit of the model. For 
example, the residuals could be plotted versus any of the x;’s, and a simple curved 
pattern might suggest the addition of a to the model. We will consider various 
approaches for detecting outliers in Section 9.3 and for finding influential 
observations in Section 9.4. Before doing so, we discuss some properties of the hat 
matrix in Section 9.2. 


9.2 THE HAT MATRIX 


It was noted following (9.2) that the hat matrix H = X(X'X)"!X’ is symmetric and 
idempotent. We now present some additional properties of this matrix. These prop- 
erties will be useful in the discussion of outliers and influential observations in 
Sections 9.3 and 9.4. 

For the centered model 


y=aj+X.B, +e (9.17) 


in (7.32), ¥ becomes 


y= 4j+XB,, (9.18) 


9.2 THE HAT MATRIX 231 
and the hat matrix is H, = X.(X!.X.)-'X’., where 


Xp — Xp X12 — XQ 


Xk — XK 
1 Xai —Xy X22 —X2 +++ Xa, — Xe 
X= (I--J)Xi = 3 . : 
n . 
Xnt —X,  Xn2 — X2 Xnk — Xk 


By (7.36) and (7.37), we can write (9.18) as 
ae 2 Lie \ a 
¥ =) + XXX.) Xy = (tiy)i + Hey 


= (‘3 oe H.)y. (9.19) 
n 


Comparing (9.19) and (9.2), we have 


1 1 
H=_J+H, =—J+ X,(X'.X,) |X’. (9.20) 
n n 


We now examine some properties of the elements hy of H. 


Theorem 9.2. If X isn x (K+ 1) of rank k + 1 <n, and if the first column of X is j, 
then the elements hj of H = X(X’ X)'X’ have the following properties: 


(ii) —5 <hy <5 forall j #i. 
(ii) hy = (1 /n) + (x1; — &1)(KX) (1 — 1), where x), = (Xa, X12, --- Xie), 


X, = (%1,%2, ...,X,), and (x); — x1)’ is the ith row of the centered matrix X-. 
(iv) tr(H) = 0 ha = K+ 1. 


(i) (1/n) < hy <1 for i=1,2,...,n. 


PROOF 


(i) The lower bound follows from (9.20), since X/.X. is positive definite. Since H 


is symmetric and idempotent, we use the relationship H = H” to find an upper 
bound on h;;. Let h/ be the ith row of H. Then 


hit 
me emcee oes) [hal (ee oe 
u it Ilo 29° °° 9 "In : = Uy 
a i= 
hin 


= hit Soh, 


(9.21) 
j#i 


232 MULTIPLE REGRESSION: MODEL VALIDATION AND DIAGNOSTICS 


Dividing both sides of (9.21) by A;; [which is positive since hj > (1/n)], we 
obtain 


(9.22) 


which implies hj < 1. 
(ii) (Chatterjee and Hadi 1988, p. 18.) We can write (9.21) in the form 


y 
hi = hig + hi a by Ni, 
r#ij 


or 


hy — hy, = hi + So hi. 
r#ijy 
Thus, hy < hi — he, and since the maximum value of h;; — he. iS + we have 
hy < ; for j A i. 
(iii) This follows from (9.20); see Problem 9.2b. 
(iv) See Problem 9.2c. 


By Theorem 9.2(iv), we see that as n increases, the values of ;; will tend to decrease. 

The function (x); — 1) (XX) xy; — x,) in Theorem 9.2(i1i) is a standardized 
distance. The standardized distance (Mahalanobis distance) defined in (3.27) is for 
a population covariance matrix. The matrix XX. is proportional to a sample covari- 
ance matrix [see (7.44)]. Thus, (x); — 1) (XX. 10; — X,) is an estimated standar- 
dized distance and provides a good measure of the relative distance of each x,; from 
the center of the points as represented by Xx. 


9.3. OUTLIERS 


In some cases, the model appears to be correct for most of the data, but one residual is 
much larger (in absolute value) than the others. Such an outlier may be due to an error 
in recording or may be from another population or may simply be an unusual obser- 
vation from the assumed distribution. For example, if the errors ¢; are distributed as 
N(O, 0°), a value of ¢; greater than 30 or less than —3o0 would occur with frequency 
.0027. 

If no explanation for an apparent outlier can be found, the dataset could be ana- 
lyzed both with and without the outlying observation. If the results differ sufficiently 
to affect the conclusions, then both analyses could be maintained until additional data 
become available. Another alternative is to discard the outlier, even though no expla- 
nation has been found. A third possibility is to use robust methods that accommodate 


9.3. OUTLIERS 233 


the outlying observation (Huber 1973, Andrews 1974, Hampel 1974, Welsch 1975, 
Devlin et al. 1975, Mosteller and Turkey 1977, Birch 1980, Krasker and Welsch 
1982). 

One approach to checking for outliers is to plot the residuals é; versus ); or versus 
i, the observation number. In our examination of residuals, we need to keep in mind 
that by (9.8), the variance of the residuals is not constant: 


var(é;) = 0° (1 — hj). (9.23) 


By Theorem 9.2(i), Aj; < 1; hence, var(é;) will be small if h;; is near 1. By Theorem 
9.2(iii), h, will be large if x,; is far from X,, where xy; = (%j1,Xj2,...,Xix)’ and 
X, = (%1,X0,..., Xt)’. By (9.23), such observations will tend to have small residuals, 
which seems unfortunate because the model is less likely to hold far from x,. A small 
residual at a point where xj; is far from x; may result because the fitted model will 
tend to pass close to a point isolated from the bulk of the points, with a resulting 
poorer fit to the bulk of the data. This may mask an inadequacy of the true model 
in the region of X;;. 

An additional verification that large values of h;; are accompanied by small 
residuals is provided by the following inequality (see Problem 9.4): 


2 
eae (9.24) 


1 é 
_ i ALA 
n EE 


For the reasons implicit in (9.23) and (9.24), it is desirable to scale the residuals so 
that they have the same variance. There are two common (and related) methods of 
scaling. 

For the first method of scaling, we use var(é;) = o°(1 — hy) in (9.23) to obtain the 
standardized residuals &;/a./1 — h;;, which have mean 0 and variance 1. Replacing o 
by s yields the studentized residual 

&j 


[ened eee 9.25 
‘ SV 1- hii ( ) 


where s* = SSE/(n — k — 1) is as defined in (7.24). The use of 7; in place of 6; 
eliminates the location effect (due to h;;) on the size of residuals, as discussed follow- 
ing (9.23). 

A second method of scaling the residuals uses an estimate of o that excludes the ith 
observation 


gj 
(=; 9.26 
S@V 1- hii ( ) 


where 5,;) is the standard error computed with the n — | observations remaining after 
omitting (y;, x}) = (va, Xi, .--,Xix), in which y; is the ith element of y and x’ is the ith 


234 MULTIPLE REGRESSION: MODEL VALIDATION AND DIAGNOSTICS 


row of X. If the ith observation is an outlier, it will more likely show up as such with 
the standardization in (9.26), which is called the externally studentized residual or the 
studentized deleted residual or R student. 

Another option is to examine the deleted residuals. The ith deleted residual, €;;), is 
computed with Bo on the basis of n — | observations with (9, x;) deleted: 


8 = Yi — Sw = Yi — XBo- (9.27) 
By definition 
Ba = (XHXw) Xe (9.28) 


where X(;) is the (n—1)x (k+ 1) matrix obtained by deleting x} = (1, xj1,..., xix), 
the ith row of X, and y;;) is the corresponding (n — 1) x ly vector after deleting y,. 
The deleted vector Bo can also be found without actually deleting (y;, x}) since 


Ej 


Ty XX (9.29) 


Bw _ B = 


(see Problem 9.5). F 
The deleted residual &;, = y; — x Bu in (9.27) can be expressed in terms of &; and 
hij as 


(9.30) 


&y — 
(i) 1—h; 


(see Problem 9.6). Thus the n deleted residuals can be obtained without computing 
n regressions. The scaled residual t; in (9.26) can be expressed in terms of &j) in 
(9.30) as 


&(i) 


\/ vart(ei)) 


t= (9.31) 


(see Problem 9.7). 
The deleted sample variance Sti used in (9.26) is defined as St = SSEw/ 


(n— k — 2), where SSEW = yyy — BX Yo This can be found without exclud- 
ing the ith observation as 


gp. — SSE _ _ SSE= é/( — hii) 
O n-k-2 n—-k—2 


(9.32) 


(see Problem 9.8). 


9.4 INFLUENTIAL OBSERVATIONS AND LEVERAGE 235 


Another option for outlier detection is to plot the ordinary residuals &; = y; — x’, B 
against the deleted residuals &( in (9.27) or (9.30). If the fit does not change substan- 
tially when the ith observation is deleted in computation of B. the plotted points 
should approximately follow a straight line with a slope of 1. Any points that are rela- 
tively far from this line are potential outliers. 

If an outlier is from a distribution with a different mean, the model can be 
expressed as E(y;) = xi/B+ 0, where x’ is the ith row of X. This is called the 
mean-shift outlier model. The distribution of t; in (9.26) or (9.31) is t(n — k — 1), 
and ¢; can therefore be used in a test of the hypothesis Hp : 6 =0. Since n tests 
will be made, a Bonferroni adjustment to the critical values can be used, or we can 
simply focus on the largest ft; values. 

The n deleted residuals in (9.30) can be used for model validation or selection by 
defining the prediction sum of squares (PRESS): 


n n A 2 
PRESS = S82, = 5* (=) (9.33) 
i=l i=l ~ i 


Thus, a residual &; that corresponds to a large value of h;; contributes more to PRESS. 
For a given dataset, PRESS may be a better measure than SSE of how well the model 
will predict future observations. To use PRESS to compare alternative models when 
the objective is prediction, preference would be shown to models with small values of 
PRESS. 


9.4 INFLUENTIAL OBSERVATIONS AND LEVERAGE 


In Section 9.3, we emphasized a search for outliers that did not fit the model. In 
this section, we consider the effect that deletion of an observation (y;,x/) has on 
the estimates B and XB. An observation that makes a major difference on these 
estimates is called an influential observation. A point (y;, x;) is potentially influential 
if it is an outlier in the y direction or if it is unusually far removed from the center of 
the x’s. 

We illustrate influential observations for the case of one x in Figure 9.2. Points 1 
and 3 are extreme in the x direction; points 2 and 3 would likely appear as outliers in 
the y direction. Even though point | is extreme in x, it will not unduly influence the 
slope or intercept. Point 3 will have a dramatic influence on the slope and intercept 
since the regression line would pass near point 3. Point 2 is also influential, but 
much less so than point 3. 

Thus, influential points are likely to be found in areas where little or no other data 
were collected. Such points may be fitted very well, sometimes to the detriment of the 
fit to the other data. 


236 MULTIPLE REGRESSION: MODEL VALIDATION AND DIAGNOSTICS 


x 


Figure 9.2 Simple linear regression showing three outliers. 


To investigate the influence of each observation, we begin with y = Hy in (9.2), 
the elements of which are 


51 = So hay; = hays + > hay. (9.34) 
j=l j#i 


By (9.22), if h,; is large (close to 1), then the his, J # i, are all small, and y; contrib- 
utes much more than the other y’s to j;. Hence, h,; is called the leverage of y;. Points 
with high leverage have high potential for influencing regression results. In general, if 
an observation (y;, x;) has a value of h;; near 1, then the estimated regression equation 
will be close to y;; that is, ¥; — y; will be small. 

By Theorem 9.2(iv), the average value of the h,;’s is (k + 1)/n. Hoaglin and 
Welsch (1978) suggest that a point with hj; > 2(k + 1)/n is a high leverage point. 
Alternatively, we can simply examine any observation whose value of h;; is unusually 
large relative to the other values of h,;. 

In terms of fitting the model to the bulk of the data, high leverage points can be 
either good or bad, as illustrated by points | and 3 in Figure 9.2. Point | may 
reduce the variance of Bo and Bi. On the other hand, point 3 will drastically alter 
the fitted model. If point 3 is not the result of a recording error, then the researcher 
must choose between two competing fitted models. Typically, the model that fits 
the bulk of the data might be preferred until additional points can be observed in 
other areas. 

To formalize the influence of a point (y;, x:), we consider the effect of its deletion 


on B and y = XB. The estimate of B obtained by deleting the ith observation (yj, x/) 
is defined in (9.28) as Bow - (XX) XHVw- We can compare Bo to B by means 


9.4 INFLUENTIAL OBSERVATIONS AND LEVERAGE 237 


of Cook’s distance, defined as 


— Bu -BYX’XB, =B) 


Di 4 Ds? (9.35) 
This can be rewritten as 
p, — Bi = XB XB — XB) 
‘ (k + 1)s? 
= 07 9Ge-9), (0.36) 


in which D; is proportional to the ordinary Euclidean distance between y,, and y. 


Thus if D; is large, the observation (y;, x;) has substantial influence on both B and 


y. A more computationally convenient form of D; is given by 


re hii 
Di =— . 9.37 
k+1 (; — ia) eee 


TABLE 9.1 Residuals and Influence Measures for the Chemical Data 
with Dependent Variable y, 


Observation Vi Vi &; hi; r; t D; 
1 41.5 42.19 — 0.688 0.430 — 0.394 — 0.383 0.029 
2 33.8 31.00 2.798 0.310 1.457 1.520 0.239 
3 27.7 27.74 — 0.042 0.155 —0.020 —0.019 0.000 
4 21.7 21.03 0.670 0.139 0.313 0.303 0.004 
5 19.9 19.40 0.495 0.129 0.230 0.222 0.002 
6 15.0 12.69 2.307 0.140 1.076 1.082 0.047 
7 12.2 12.28 — 0.082 0.228 —0.040 —0.039 0.000 
8 4.3 5.57 — 1.270 0.186 —0.609 —0.596 0.021 
9 19.3 20.22 —0.917 0.053 —0.408 —0.396 0.002 
10 6.4 4.76 1.642 0.233 0.811 0.801 0.050 
11 37.6 35.68 1.923 0.240 0.954 0.951 0.072 
12 18.0 13.09 4.906 0.164 2.320 2.800 0.264 
13 26.3 27.34 — 1.040 0.146 — 0.487 — 0.474 0.010 
14 9.9 13.51 — 3.605 0.245 — 1.795 — 1.956 0.261 
15 25.0 26.93 — 1.929 0.250 — 0.964 —0.961 0.077 
16 14.1 15.44 — 1.342 0.258 — 0.674 — 0.661 0.039 
17 15.2 15.44 — 0.242 0.258 —0.121 —0.117 0.001 
18 15.9 19.54 — 3.642 0.217 — 1.780 — 1.937 0.220 
19 19.6 19.54 0.058 0.217 0.028 0.027 0.000 


238 MULTIPLE REGRESSION: MODEL VALIDATION AND DIAGNOSTICS 


(see Problem 9.9). Muller and Mok (1997) discuss the distribution of D; and provide a 
table of critical values. 


Example 9.4. We illustrate several diagnostic tools for the chemical reaction data of 
Table 7.4 using y,. In Table 9.1, we give &;, h;;, and some functions of these from 
Sections 9.3 and 9.4. 

The guideline for h;; in Section 9.4 is 2(k + 1)/n = 2(4)/19 = .421. The only 
value of h,; that exceeds .421 is the first, h;; = .430. Thus the first observation has 
potential for influencing the model fit, but this influence does not appear in 
t; = —.383 and D,; = .029. Other relatively large values of h;; are seen for obser- 
vations 2, 11, 14, 15, 16, and 17. Of these only observation 14 has a very large (absol- 
ute) value of t;. Observation 12 has large values of &;, 7;, t; and D; and is a potentially 
influential outlier. 

The value of PRESS as defined in (9.33) is PRESS = 130.76, which can be 
compared to SSE = 80.17. 


PROBLEMS 


9.1 Verify the following properties of the residual vector € as given in (9.7)—(9.14): 


(a) E(é)=0 

(b) cov(é) = 0° (1— H) 
(ce) cov(é,y) = o° (I— H) 
(d) cov(é,y) =O 

(e) €= 71, &/n=0 
(f) é'y=y'd— Wy 

(g) € 
(h) é’X=0' 


9.2 (a) In the proof of Theorem 9.2(ii), verify that the maximum value of h;; — he 
1 


is 3. 

4 

(b) Prove Theorem 9.2(iii). 
(c) Prove Theorem 9.2(iv). 


9.3. Show that an alternative expression for h;; in Theorem 9.2(iii) is the following: 
ese x1)! eoSe eet 6 
a + (Xj — X1) 1 — sen ir 
where 6), is the angle between x; — x; and a,, the rth eigenvector of XiX. 


(Cook and Weisberg 1982, p. 13). Thus h,; is large if (x1; — X1)' (x1; — X1) is 
large or if 6;, is small for some r. 


PROBLEMS 239 


9.4 Show that Lc hy + 8 /# €é<1 as in (9.24). The following steps are 
suggested: 


(a) Let H* be the hat matrix corresponding to the augmented matrix (X, y). 
Then 


H* = (X, yi(X y)(X yl (yy 
pp. aaa a ay 

=I ay.) by) 
yX yy y 


Use the inverse of a partitioned matrix in (2.50) with Aj; = X’X, 
aj. = X’y, and ax = y’y to obtain 


* 1 i —lw y y —lw/ y y —lw 
H =H+7[X(X'X) X'yy’X(X'X) 1X’ — yy’X(X’x) |x 
— X(X'X)X’yy’ + yy’] 
1 
= H+; [Hyy'H — yy'H — Hyy’ + yy'l, 


where b = y'y — y’/X(X’X)!X’y. 
(b) Show that the above expression factors into 


(—Hyyd-H) |, éé' 
yd — Hy é 


H* =H+ 


* 


which gives h* = hj + &7/é'é. 
(c) The proof is easily completed by noting that H* is a hat matrix and there- 
fore (1/n) < hj; < 1 by Theorem 9.2(i). 


9.5 Show that B,, = B — &(X'X)~!x;/(1 — hj) as in (9.29). The following steps 
are suggested: 
(a) Show that X'X = X(yXw@ + xix; and that X’'y = X(yyqj + xii. 
(b) Show that (X’X)"!X{)yq) = B— (X’X)' xii. 
(c) Using the following adaptation of (2.53) 


B'ee'B"! 


B—cc’) }=B!+ 
( ) 1—-cB'e 


show that 


(XX)! x;xi(X'X)! 
1 — hj 


By = |XX)! + XH: 


240 


9.6 
9.7 


9.8 


9.9 


MULTIPLE REGRESSION: MODEL VALIDATION AND DIAGNOSTICS 


(d) Using the result of parts (b) and (c), show that 


Ej 


1 — hj 


Bw =B- (X'X)!x;. 


Show that &j) = &;/(1 — hy) as in (9.30). 


Show that t; = ba/V var(&(i)) in (9.31) is the same as ¢t; = &i/saV1 — hy in 
(9.26). The following steps are suggested: 


(a) Using Ei) = é;/( -_ hii) in (9.30), show that var(&(i)) = v/( = hij). 
(b) If var(é)) in part (a) is estimated by var(é)) = St /(1 — hii), show that 


&y//Vat(ew) = &)/s~ V1 — hii. 
Show that SSEj@ = YH _ YnXw Bu can be written in the form 
SSEw = SSE — &7/(1 — hi) 


as in (9.32). One way to do this is as follows: 


(a) Show that y(ny = y'y — ye. 
(b) Using Problem 9.5a,d, we have 


gj 


YoXoBy = (y'X — yx] B- XX) xi]. 
Show that this can be written as 
a2 
y 72 as 2 é; 
YX Bw = ¥ XB— yj + =the. 


(c) Show that 
SSE, = SSE — &?/(1 — hii). 


Show that D; = hi /[(k + 1)(1 — hj) in (9.37) is the same as D; in (9.35). 
This may be done by substituting (9.29) into (9.35). 


For the gas vapor data in Table 7.3, compute the diagnostic measures 
Vi, &), hii, ri, t;, and D;. Display these in a table similar to Table 9.1. Are 
there outliers or potentially influential observations? Calculate PRESS and 
compare to SSE. 


For the land rent data in Table 7.5, compute the diagnostic measures 
Vi, 6), hi, vi, t;, and D;. Display these in a table similar to Table 9.1. Are 


PROBLEMS 241 


there outliers or potentially influential observations? Calculate PRESS and 
compare to SSE. 


For the chemical reaction data of Table 7.4 with dependent variable y>, 
compute the diagnostic measures };, &;, hj, rj, t;, and D;. Display these in a 
table similar to Table 9.1. Are there outliers or potentially influential obser- 
vations? Calculate PRESS and compare to SSE. 


1 0 Multiple Regression: Random x’s 


Throughout Chapters 7—9 we assumed that the x variables were fixed; that is, that 
they remain constant in repeated sampling. However, in many regression appli- 
cations, they are random variables. In this chapter we obtain estimators and test stat- 
istics for a regression model with random x variables. Many of these estimators and 
test statistics are the same as those for fixed x’s, but their properties are somewhat 
different. 

In the random-x case, k + 1 variables y, x1, x2, ...,x% are measured on each of 
the n subjects or experimental units in the sample. These n observation vectors 
yield the data 


Yi M1 X12 «+s Xk 
2 X21 X22 «++ XK 

(10.1) 
Yn Xnl Xn2 see Xnk- 


The rows of this array are random vectors of the second type described in Section 3.1. 


The variables y, x1, X2,...,X,; in a row are typically correlated and have different var- 
iances; that is, for the random vector (y, x), ..., x,) = (y, x’), we have 
y 
xX) 
cov] . =cor(?) =3, 
: x 
Xk 


where & is not a diagonal matrix. The vectors themselves [rows of the array in (10.1)] 
are ordinarily mutually independent (uncorrelated) if they arise from a random 
sample. 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


243 


244 MULTIPLE REGRESSION: RANDOM x’s 
In Sections 10.1—10.5 we assume that y and the x variables have a multivariate 


normal distribution. Many of the results in Sections 10.6—10.8 do not require a 
normality assumption. 


10.1 MULTIVARIATE NORMAL REGRESSION MODEL 


The estimation and testing results in Sections 10.1—10.5 are based on the assumption 
that (y, x1,..., x,) = (y, x’) is distributed as N41 (qe, %) with 


My 
My m 
[0 
‘ BM, 
Mx 
Oyy Oy) Oyk 
O1y| oC eG Cy oF 
eee nae ae me =I ” =, (10.3) 
: : : Oy, Lax 
Ony| Okt +++ Oke 


where 2, is the mean vector for the x’s, }, is the vector of covariances between y and 
the x’s, and &,, is the covariance matrix for the x’s. 
From Corollary 1 to Theorem 4.4d, we have 


E(y|x) = gw, + 04,3 — By) (10.4) 
= Bo + Bix, (10.5) 
where 
By = by — Oy Ey Bes (10.6) 
BH le ee (10.7) 


From Corollary 1 to Theorem 4.4d, we also obtain 
var(y|X) = yy — 6,3! Oye = O°. (10.8) 


The mean, E(y|x) = My + Oa (x — p.,), 1s a linear function of x, but the variance, 
C= Oyy — a oyx, is not a function x. Thus under the multivariate normal 


10.2 ESTIMATION AND TESTING IN MULTIVARIATE NORMAL REGRESSION 245 


assumption, (10.4) and (10.8) provide a linear model with constant variance, which is 
analogous to the fixed-x case. Note, however, that E(y|x) = By + Bx in (10.5) does 
not allow for curvature such as E(y) = By + Bix + Box’. Thus E(y|x) = Bo + Bx 
represents a model that is linear in the x’s as well as the B’s. This differs from the 
linear model in the fixed-x case, which requires only linearity in the B’s. 


10.2, ESTIMATION AND TESTING IN MULTIVARIATE 
NORMAL REGRESSION 


Before obtaining estimators of Bo, B,, and o in (10.6)—(10.8), we must first estimate 
p and &. Maximum likelihood estimators of ps and & are given in the following 
theorem. 


Theorme 10.2a. If (1, x/), (2, X5), ---, On, X/,) [tows of the array in (10.1)] is a 
random sample from N;,+1(@,%), with je and & as given in (10.2) and (10.3), the 
maximum likelihood estimators are 


(2)-(0) wo 
ft, x 


‘ | —-1/Sy s 
Seg & P), (10.10) 


where the partitioning of ft and S is analogous to the partitioning of yx and & in (10.2) 
and (10.3). The elements of the sample covariance matrix S are defined in (7.40) and 
in (10.14). 


Proor. Denote (y;,x}) by v;,i=1,2,...,n. As noted below (10.1), vj, V2, .--.Vn 
are independent because they arise font a random sample. The likelihood function 
(joint density) is therefore given by the product 


L(w,) = | [fos wo) 
i=1 
lier 


1 n ry! 
= _ (Vi- py > (vj—p)/2 
(Vv FmynerDyy 2 © 2am ne ro) 


ei E vi-w)/2 


Note that L(, >) = he (vi; @, &) is a product of n multivariate normal densities, 
each involving k + | random variables. Thus there are n(k + 1) random variables as 
compared to the likelihood L(B, o*) in (7.50) that involves n random variables 
Y1,Y2, ---,¥n [the x’s are fixed in (7.50)]. 


246 MULTIPLE REGRESSION: RANDOM x’s 


To find the maximum likelihood estimator for yz, we expand and sum the exponent 
in (10.11) and then take the logarithm to obtain 


n 1 iyo l 
InL(w, >) = —n(k + Lin v2a— In| a5 > Vi Vi 
yl Ne ig-l 
See 10.12 
+ pd v zum Bw (10.12) 


Differentiating (10.12) with respect to mw using (2.112) and (2.113) and setting the 
result equal to 0, we obtain 


OInL(p, >) 
Op 7 


2n 
0—-0-04+>'! ,-=>'lp=0, 
+2 Be 7 Bp 


which gives 


where X = (X,,%2, ...,X,)' is the vector of sample means of the x’s. To find the 
maximum likelihood estimator of {, we rewrite the exponent of (10.11) and then 
take the logarithm to obtain 


InL(w, &') = —n(k + 1) In V27 + sin Sas De (vi — v= ‘(vi — 9) 
—50- w=" w) 
= —n(k + 1)InV2a+ sin So xt [> 2 =NOr= ¥ 
— Sele — pv — wh 


Differentiating this with respect to s using (2.115) and (2.116), and setting the 
result equal to 0, we obtain 


-1 
ee ) =n> diag) ye (v; — V¥)(v; — vy + sag Bs (v; — V)\(v; — »| 


n(¥ — p(T — py! + Fdiagl(W — pw — po] =0. 


10.2. ESTIMATION AND TESTING IN MULTIVARIATE NORMAL REGRESSION 247 


Since ft = V, the last two terms disappear and we obtain 


$= WV; — ¥) = S. (10.13) 


See Problem 10.1 for verification that 57; (vi — V)(v; — vy =(n— DS. 


In partitioned form, the sample covariance matrix S can be written as in (10.10) 


Syy |__ Syl Syk 
/ 
Syy Sy, Siy| 8 s 
S= ( ] =] Sy) Sie Sk (10.14) 
Sx Sx : : : 
Sky SkI eee Shk 


where s,, is the vector of sample covariances between y and the x’s and §S,, is the 
sample covariance matrix for the x’s. For example 


— Me 61 - Ga — X41) 


mt n—1 
a ae oe 
S11 . 
n—1 
— Wer Ga — XD G2 — %2) 
S12 
(n— 1) 


[see (7.41)-(7.43)]. By (5.7), E(syy) = oyy and E(s jj) = oj. By (5.17), E(sy) = ay 
and E(s) = o;;. Thus E(S) = 2, where & is given in (10.3). The maximum likeli- 
hood estimator > = (n — 1)S/n is therefore biased. 

In order to find maximum likelihood estimators of Bo, B,, and o we first note the 
invariance property of maximum likelihood estimators. 


Theorem 10.2b. The maximum likelihood estimator of a function of one or more 
parameters is the same function of the corresponding estimators; that is, if 6 is the 
maximum likelihood estimator of the vector or matrix of parameters 0, then FD) is 
the maximum likelihood estimator of g(6). 


Proor. See Hogg and Craig (1995, p. 265). 


Example 10.2. We illustrate the use of the invariance property in Theorem 10.2b by 
showing that the sample correlation matrix R is the maximum likelihood estimator of 
the population correlation matrix P, when sampling from the multivariate normal 


248 MULTIPLE REGRESSION: RANDOM x’s 


distribution. By (3.30), the relationship between P 


, and Y is given by 
P, = D;!=D;', where D, = [diag(&)]!/”, so that 


1 1 1 
D;' = diag : asad : 
(se Voy =) 
The maximum likelihood estimator of 1/,/oj; is 1/\/6j, where 


6 = (1/n)S_\(yy — 3°. Thus Dz! = diag(1/Veu, 1/V6n,...,1/V Epp, and 


we obtain 


-1¢p-! = (oR 
o o (sete) 

Vi Ow — YM Vik — y,)/n 
(Xi 0% a 5 /my/ oe yy /n 


A (yi — YM Vik — Vx) 
nis (yi — da (Vik — ) 
= (rx) =R. 


Maximum likelihood estimators of By, B,, and o are now given in the following 
theorem. 


Theorem 10.2c. If (y1,x/),(92,%5),---.(n,X,,), iS a random sample from 
Ney1 (ue, &), where ps and & are given by (10.2) and (10.3), the maximum likelihood 
estimators for Bo, B,, and o in (10.6)—(10.8) are as follows: 


Bo = — 8,8. %, (10.15) 

B, = S,'s,., (10.16) 

ee ce 2 t gl 

o* =——s* where s° = Sy — Sia Syx- (10.17) 
n 


The estimator s? is a bias-corrected estimator of 0°. 


Proor. By the invariance property of maximum likelihood estimators (Theorem 
10.2b), we insert (10.9) and (10.10) into (10.6), (10.7), and (10.8) to obtain the 
desired results (using the unbiased estimator S in place of %). 


10.3. STANDARDIZED REGRESSION COEFFICENTS 249 


The estimators Bo. fB,, and s” have a minimum variance property analogous to that 
of the corresponding estimators for the case of normal y’s and fixed x’s in Theorem 
7.6d. It can be shown that f and S in (10.9) and (10.10) are jointly sufficient for pw 
and & (see Problem 10. .2). Then, ype some additional properties that can be demon- 
strated, it poHOws that Bos B,, and s” are minimum variance unbiased estimators for 
Bo. Bi, and o” (Graybill 1976, p. 380). 

The maximum likelihood estimators Bo and B, in (10.15) and (10.16) are the same 
algebraic functions of the observations as the least-squares estimators given in (7.47) 
and (7.46) for the fixed-x case. The estimators in (10.15) and (10.16) are also identical 
to the maximum likelihood estimators for normal y’s and fixed x’s in Section 7.6.2 
(see Problem 7.17). However, even though the estimators in the random-x case and 
fixed-x case are the same, their distributions differ. When y and the x’s are multi- 
variate normal, Bi does not have a multivariate normal distribution as it does in 
the fixed-x case with normal y’s [Theorem 7.6b(i)]. For large n, the distribution is 
similar to the multivariate normal, but for small n, the distribution has heavier tails 
than the multivariate normal. 

In spite of the nonnormality of B, in the random-x model, the F tests and tf tests 
and associated confidence regions and intervals of Chapter 8 (fixed-x model) are 
still appropriate. To see this, note that since the conditional distribution of y for 
a given value of x is normal (Corollary 1 to Theorem 4.4d), the conditional 
distribution of the vector of observations y =(y,y2,-.-,yn) for a given value 
of the X matrix is multivariate normal. Therefore, a test statistic such as (8.35) 
is distributed conditionally as an F for the given value of X when Hp is true. 
However, the central F distribution depends only on degrees of freedom; it does 
not depend on X. Thus under Ho, the statistic has (unconditionally) an F distri- 
bution for all values of X, and so tests can be carried out exactly as in the 
fixed-x case. 

The main difference is that when Ho is false, the noncentrality parameter is a func- 
tion of X, which is random. Hence the noncentral F distribution does not apply to the 
random-x case. This only affects such things as power calculations. 

Confidence intervals for the 6;'s in Section 8.6.2 and for linear functions of 
the B,’s in Section 8.6.3 are based on the central f distribution [e.g., see (8.48)]. 
Thus they also remain valid for the random-x case. However, the expected width 
of the interval differs in the two cases (random x’s and fixed x’s) because of random- 
ness in X. 

In Section 10.5, we obtain the F test for Ho: 8, = 0 using the likelihood ratio 
approach. 


10.3. STANDARDIZED REGRESSION COEFFICENTS 


We now show that the regression coefficient vector B, in (10.16) can be expressed in 
terms of sample correlations. By analogy to (10.14), the sample correlation matrix 


250 MULTIPLE REGRESSION: RANDOM x’s 


can be written in partitioned form as 


1 | Ty, Vy2 yk 
I oj ly 1 T1Q «s+ Vig 
R= yx )— |] ry} rr 1... roe |, (10.18) 
Ky Ru . . ; ; 
Thy | Tei Tk 1 


where r,, is the vector of correlations between y and the x’s and R,, is the correlation 


matrix for the x’s. For example 
ae iat Oi — G2 — X) 
y 5) 
(983 poe, (91 — YP Dien G2 — 2 
S12 ye Ora — X12 — X2) 


oot a5 =a 
V'8753 Da (xin — X1)° | On — X2) 


FQ 


By analogy to (3.31), R can be converted to S by 


S = DRD, 


where D = [diag(S)]!/ a which can be written in partitioned form as 


Sy | 0 0 
0 4/S11 0 be ais 0 0 
—-|0/ 0 Tm... O ae 
De] OO ver ver @ ae op.) 
0 0 0 wee Skk 


Using the partitioned form of S in (10.14), S = DRD can be written as 


2 / 
——t S yy S é x = S y Sy Ty D, 
Ss ( Sx ) ( sD), D.R»D, )’ (10.19) 


Syx 
so that 


Sur = D,RuD,, (10.20) 


Syx = SyDpNyxs (10.21) 


10.3 STANDARDIZED REGRESSION COEFFICENTS 251 


where D, = diag(s1,52,...,5,) and sy = /% = ,/Syy is the sample standard devi- 
ation of y. When (10.20) and (10.21) are substituted into (10.16), we obtain an 


expression for B, in terms of correlations: 


B, = syDz' Re! ty. (10.22) 


The regression coefficients Bi. Bo, ant By in B, can be standardized so as to show 
the effect of standardized x values (sometimes called z scores). We illustrate this for 
k = 2. The model in centered form [see (7.30) and an expression following (7.38)] is 


5 = 9+ Bien — X1) + Bon — XH). 


This can be expressed in terms of standardized variables as 


Yi7TY_S1 Bi (=) i 52 B (= =), (10.23) 


Sy Sy Sy $2 


where s; = ,/Sj is the standard deviation of x;. We thus define the standardized coef- 
ficients as 


a Sj + 
B=, Bi. 
Sy 


These coefficients are often referred to as beta weights or beta coefficients. Since they 
are used with standardized variables (x; — x;)/s; in (10.23), the B;’s can be readily 
compared to each other, whereas the B;’s cannot be so compared. [Division by s, 


in (10.23) is customary but not necessary; the relative values of s, B, and 5>B> are 
the same as those of s;B;/sy and s2B,/s,.] 
The beta weights can be expressed in vector form as 


Bi = = Dab 
Using (10.22), this can be written as 

Bi = Ry tye. (10.24) 
Note that B; in (10.24) is not the same as B; from the reduced model in (8.8). Note 


also the analogy of B} = Ry'ryx in (10.24) to B, = S,'s,, in (10.16). In effect, Ry, 
and ry, are the covariance matrix and covariance vector for standardized variables. 


252 MULTIPLE REGRESSION: RANDOM x’s 


Replacing Ss, and s,, by R,! and r,, leads to regression coefficients for standardized 
variables. 


Example 10.3. The following six hematology variables were measured on 51 
workers (Royston 1983): 


y = lymphocyte count x3 = white blood cell count (x .01) 
x, = hemoglobin concentration x, = neutrophil count 
X2 = packed-cell volume X5 = serum lead concentration 


The data are given in Table 10.1. 
For y, x, S,, and s,,, we have 


y = 22.902, x’ = (15.108, 45.196, 53.824, 25.529, 21.039), 


0.691 1.494 3.255 0.422 —0.268 
1.494 5401 10.155 1.374 1.292 


Sx =| 3.255 10.155 200.668 64.655 4.067 |, 
1.535 
4.880 

Sx = | 106.202 


0.422 1.374 64.655 56.374 0.579 
—0.268 1.292 4.067 0.579 18.078 
3.753 
3.064 


By (10.15) to (10.17), we obtain 


—0.491 
~0.316 
B, =S,'s,,= | 0.837 
—0.882 


0.025 
Bo =y—s' S_'x = 22.902 — 1.355 = 21.547, 


yx XX 


S* = Sy — 8), S7'8), = 90.2902 — 83.3542 = 6.9360. 


10.3. STANDARDIZED REGRESSION COEFFICENTS 253 


TABLE 10.1 Hematology Data 


Observation 
Number y xy X2 x3 X4 X5 
1 14 13.4 39 41 25 17 
2 15 14.6 46 50 30 20 
3 19 13.5 42 45 21 18 
4 23 15.0 46 46 16 18 
5 17 14.6 44 51 31 19 
6 20 14.0 44 49 24 19 
7 21 16.4 49 43 17 18 
8 16 14.8 44 44 26 29 
9 27 15.2 46 41 13 27 
10 34 15.5 48 84 42 36 
11 26 15.2 47 56 27 22 
12 28 16.9 50 Sl 17 23 
13 24 14.8 44 47 20 23 
14 26 16.2 45 56 25 19 
15 23 14.7 43 40 13 17 
16 9 14.7 42 34 22 13 
17 18 16.5 45 54 32 17 
18 28 15.4 45 69 36 24 
19 17 15.1 45 46 29 17 
20 14 14.2 46 42 25 28 
21 8 15.9 46 52 34 16 
22 25 16.0 47 47 14 18 
23 37 17.4 50 86 39 17 
24 20 14.3 43 55 31 19 
25 15 14.8 44 42 24 29 
26 9 14.9 43 43 32 17 
27 16 15.5 45 52 30 20 
28 18 14.5 43 39 18 25 
29 17 14.4 45 60 37 23 
30 23 14.6 44 47 21 27 
31 43 15.3 45 719 23 23 
32 17 14.9 45 34 15 24 
33 23 15.8 47 60 32 21 
34 31 14.4 44 77 39 23 
35 11 14.7 46 37 23 23 
36 25 14.8 43 52 19 22 
37 30 15.4 45 60 25 18 
38 32 16.2 50 81 38 18 
39 17 15.0 45 49 26 24 
40 22 15.1 47 60 33 16 
41 20 16.0 46 46 22 22 
42 20 15.3 48 55 23 23 


(Continued) 


254 


TABLE 10.1 Continued 


MULTIPLE REGRESSION: RANDOM x’s 


Observation 
Number y xy xX X3 X4 X5 
43 20 14.5 41 62 36 21 
44 26 14.2 41 49 20 20 
45 40 15.0 45 72 25 25 
46 22 14.2 46 58 31 22 
47 61 14.9 45 84 17 17 
48 12 16.2 48 31 15 18 
49 20 14.5 45 40 18 20 
50 35 16.4 49 69 22 24 
51 38 14.7 44 78 34 16 
The correlations are given by 

1.000 0.774 0.277 0.068 —0.076 0.194 

0.774 1.000 0.308 0.079 0.131 0.221 

Rw = 0.277 0.308 1.000 0.608 0.068 |, rhx = | 0.789 

0.068 0.079 0.608 1.000 0.018 0.053 

—0.076 0.131 0.068 0.018 1.000 0.076 


By (10.24), the standardized coefficient vector is given by 


—0.043 
—0.077 


R* —p-l _ 
B; =R, Bx = 


1.248 


—0.697 
0.011 


10.4 R? IN MULTIVARIATE NORMAL REGRESSION 


In the case of fixed x’s, we defined R? as the proportion of variation in y due to 


regression [see (7.55)]. In the case of random x’s, we obtain R as an estimate of a 
population multiple correlation between y and the x’s. Then R? is the square of this 
sample multiple correlation. 


between y and the linear function w = wy + ae (x — py): 


Pylx = corr(y, w) _ 


Oy 


OyOw 


The population multiple correlation coefficient p,\x is defined as the correlation 


(10.25) 


10.4 R* IN MULTIVARIATE NORMAL REGRESSION 255 


(We use the subscript y|x to distinguish Py|x from p, the correlation between y and x in 
the bivariate normal case; see Sections 3.2, 6.4, and 10.5). By (10.4), w is equal to 


E(y|x), which is the population analogue of } = Bo + Bix, the sample predicted 
value of y. As x varies randomly, the population predicted value w= pb,+ 


a (x — #,) becomes a random variable. 
It is easily established that cov( y, w) and var(w) have the same value: 


cov(y, w) = var(w) = of, 30! Oy. (10.26) 


Then the population multiple correlation p,), in (10.25) becomes 


cov(y,w) Oe Oyx 


Pix ~ var(yyvarcn) oy 


and the population coefficient of determination or population squared multiple 
correlation Prix is given by 


—1 
Dy Oy Xx Oyx 


Pig = (10.27) 


Dyy 


We now list some properties of p,), and Dyin: 


1. py is the maximum correlation between y and any linear function of x: 


Py|x = MAX py, a’x. (10.28) 


This is an alternative definition of p,), that is not based on the multivariate 
normal distribution as is the definition in (10.25). 
2; Paix can be expressed in terms of determinants: 


| 


; (10.29) 
Oyy| Darl 


2 
Prix =1- 


where > and &,, are defined in (10.3). 


3: Paix is invariant to linear transformations on y or on the x’s; that is, if u = ay and 
v = Bx, where B is nonsingular, then 


Priv = Pox: (10.30) 


(Note that v here is not the same as vy; used in the proof of Theorem 10.2a.) 


256 MULTIPLE REGRESSION: RANDOM x’s 


4. Using var(w) = se i oy, in (10.26), Paix in (10.27) can be written in the 
form : 


(10.31) 


Since w = py + Oi (x — p,) is the population regression equation, Paix in 
(10.31) represents the proportion of the variance of y that can be attributed to 
the regression relationship with the variables in x. In this sense, Paix is analo- 
gous to R? in the fixed-x case in (7.55). 

5. By (10.8) and (10.27), var(y|x) can be expressed in terms of Pos 


var(y|x) = oy — 6, 5 = Gy — wPhix 
(10.32) 
= dyy(1 — Phx): 
6. If we consider y — was a residual or error term, then y — w is uncorrelated with 


the x’s 
cov(y — w,x) = 0’ (10.33) 


(see Problem 10.8). 
We can obtain a maximum likelihood estimator for Pols by substituting esti- 
mators from (10.14) for the parameters in (10.27): 


s S-! oe 
Re = SosPae Sox (10.34) 


Syy 


We use the notation R* rather than f pr An because (10. pes is recognized as having the same 
form as R? for the fixed-x case in (7. 59). We refer to R” as the sample coeificient of deter- 
mination or as the sample squared multiple correlation. The square root of R* 


i si Syx 
R- SyxSax Syx (10.35) 


Syy 


is the sample multiple correlation coefficient. 
We now list several properties of R and R?, some of which are analogous to prop- 
erties of Paix above. 


1. R is equal to the correlation between y and jy = Bo + Bux ie BiXe a 
Bo + Bix: 
R=1ryy. (10.36) 


2. Ris equal to the maximum correlation between y and any linear combination of 
the x’s, a’x: 


R= max r,ax. (10.37) 
a 


10.4 R* IN MULTIVARIATE NORMAL REGRESSION 257 


2 . , 
3. R° can be expressed in terms of correlations: 


RS 7 Rie (10.38) 


YX" XX 


where r,, and R,, are from the sample correlation matrix R partitioned as in 
(10.18). 
4. R? can be obtained from R7!: 


Sy? (10.39) 


where 7” is the first diagonal element of R~!'. Using the other diagonal 
elements of R™!, this relationship can be extended to give the multiple corre- 
lation of any x; with the other x’s and y. Thus from R_! we obtain multiple 
correlations, as opposed to the simple correlations in R. 

5. R* can be expressed in terms of determinants: 


Ss 
ae axe 5 | (10.40) 
YY |XX 
R 
=1- (10.41) 


where S,, and R,, are defined in (10.14) and (10.18). 
6. From (10.24) and (10.38), we can express R? in terms of beta weights: 


Ri). (10.42) 


where B = Re! ryx. This equation does not imply that R? is the sum of squared 
partial correlations (Section 10.8). 


7. If Paix = 0, the expected value of R” is given by 


E(R’) = — (10.43) 


n—- 


Thus R? is biased when Prix is 0 [this is analogous to (7.57)]. 
8. RK? > max; r., where ry; is an element of Ee = (1,125 + +5 Tyk)- 
9. R? is invariant to full rank linear transformations on y or on the x’s. 


Example 10.4. For the hematology data in Table 10.1, S..,8)x, Rx, and 4, were 
obtained in Example 10.3. Using either (10.34) or (10.38), we obtain 


R? = 9232. 


258 | MULTIPLE REGRESSION: RANDOM x’s 
10.5 TESTS AND CONFIDENCE INTERVALS FOR R? 


Note that by (10.27), Prix = 0 becomes 


0, Ye Ox 


2 = yxo XX = 
pe Oyy : 
which leads to o),=0 since %,, is positive definite. Then by (10.7), 
B, = oy Oy, = 0, and Ao: Paix = 0 is equivalent to Hy: B, = 0. 

The F statistic for fixed x’s is given in (8.5), (8.22), and (8.23) as 


(B'X'y — nj?)/k 
(y'y — B'X'y)/(n—k -1) 
= R?’/k 
ad — R*)/(n —k—-—1) 


(10.44) 


The test statistic in (10.44) can be obtained by the likelihood ratio approach in the 
case of random x’s (Anderson 1984, pp. 140-142): 


Theorem 10.5. If (91, x1), (2,X5),---,(¥n,X,,) is a random sample from 
Neyi(e, &), where ys and & are given by (10.2) and (10.3), the likelihood ratio test 
for Ho: B, = 0 or equivalently Ho: Phy = 0 can be based on F in (10.44). We 
reject Ho if F > Fok n—k-1- 


Proor. Using the notation v; = (;, x), as in the proof of Theorem 10.2a, the likeli- 
hood function L(, }) = ee Sf(vis B, >) is given by (10.11), and the likelihood 
ratio is 


maxy, L(p, &) 
R = ———__.. 
maxy, L(p, &) 


Under Hj, the parameters yr and & are essentially unrestricted, and we have 


max L(p, Y) = max L(p, %) = L(p, >), 


where jv and > are the maximum likelihood estimators in (10.9) and (10.10). 
Since (v; — py> wi — p) is a scalar, the exponent of L(t, ) in (10.11) can be 


10.5 TESTS AND CONFIDENCE INTERVALS FOR R? 259 


written as 


pains [ov — py "(vi - 7) eit [=o — pV — | 


2 2 


[EL — wr 0] 
= : | 


Then substitution of f and & for pz and & in L(p, S) gives 


ee 1 tre ns 
. 5 = (2nd /2) 
nes (m, 2) (p>) (amined spr? © 


en mk+1)/2 


= (J 2m @ DS fr? ; 


Under Ho Oi = 0, we have oy, = 0, and > in (10.3) becomes 


Xo= & 2) (10.45) 


whose maximum likelihood estimator is 


. yy 0 
as ee 10.46 
= ( 0 dx : 


Using Xo in (10.46) and # = Vv in (10.9), we have 


BS 1 ane 
max L(y, >) = L(t, Yo) = P eo tt&o 2%o/2)_ 
Ho ( [Pay S |"/? 


By (2.74), this becomes 


3 en mk+1)/2 
L(p, 20) = a enar (10.47) 
( Yay ghl718 | /2 
Thus 
Sn/2 
pe t (10.48) 


an/2i  n 
Cyl 


260 MULTIPLE REGRESSION: RANDOM x’s 
Substituting > = (n— 1)S/n and using (10.40), we obtain 

LR = (1 — R?)"””. (10.49) 
We reject Ho for (1 — R2y"/ 2 < c, which is equivalent to 


oe R’/k 
(1 — R’)/(n— 


> F. 2 
ee Go a,k,n—k—-1> 


since R? /a- R’) is a monotone increasing function of R? and F is distributed as 
F(k,n — k — 1) when A is true (Anderson 1984, pp. 138-139). 


When k=1, F in (10.44) reduces to F=(n-— 2)r?/(1 —r’). Then, by 
Problem 5.16 


as Vn —2r 
v1l—-r 
[see (6.20)] has a ¢ distribution with n — 2 degrees of freedom (df) when (y, x) has a 
bivariate normal distribution with p = 0. 

If (y, x) is bivariate normal and p ¥ 0, then var(r) = (1 — poy /n and the function 


go eeee (10.50) 
1-p 
is approximately standard normal for large n. However, the distribution of u 
approaches normality very slowly as n increases (Kendall and Stuart 1969, p. 236). 
Its use is questionable for n < 500. 
Fisher (1921) found a function of r that approaches normality much faster than 
does (10.50) and can thereby be used with much smaller n than that required for 
(10.50). In addition, the variance is almost independent of p. Fisher’s function is 


iene) 
pevige Sah (10.51) 
2 1-r 


where tanh’ 'r is the inverse hyperbolic tangent of r. The approximate mean and 
variance of z are 


1 
BO Sip = ea (10.52) 
2 1l-p 
1 
var(z) = ——. (10.53) 
n—3 


10.5 TESTS AND CONFIDENCE INTERVALS FOR R? 261 


We can use Fisher’s z transformation in (10.51) to test hypotheses such as Ho : p = po 
or Ho: py = py. To test Hp: p = py vs. Hi: p # po, we calculate 


_27 tanh | py 


J1/(n = 3)” 


which is approximately distributed as the standard normal N(0, 1). We reject Ho if 


(10.54) 


|v] > Za/2, where z = tanh” !r and z, /2 is the upper a/2 percentage point of the stan- 
dard normal distribution. To test Ho: p, = py VS. Hi: p, # py for two independent 
samples of sizes n; and nz yielding sample correlations 7; and r2, we calculate 


Ke 62. 


” «Gi 3) 10s 3) 


(10.55) 


and reject Ho if |v| > Ze/2, where z = tanh 'r,; and z =tanh~!r. To test 
Ho: p; = ++: = p, for g > 2, see Problem 10.18. 

To obtain a confidence interval for p, we note that since z in (10.51) is approxi- 
mately normal, we can write 


— tanh! 
P{ ~ton << cap) 1a (10.56) 


Solving the inequality for p, we obtain the approximate 100(1—«@)% confidence 
interval 


La/2 Zq/2 ) 
tanh{ z —- ———— } < p < tanh| z+ ——_—_ }. 10.57 
(ay) seston (c+ es ee 


A confidence interval for Prix was given by Helland (1987). 


Example 10.5a. For the hematology data in Table 10.1, we obtained Rin Example 
10.4. The overall F test of Ho: B, = 0 or Ho Sia = 0 is carried out using F in 
(10.44): : 


a2 R’/k 
(ay Gia 1) 

_, = 2973975 
~ (1 — .9232)/45 


= 108.158. 


The p value is less than 10° '°. 


262 MULTIPLE REGRESSION: RANDOM x’s 


Example 10.5b. To illustrate Fisher’s z transformation in (10.51) and its use to 
compare two independent correlations in (10.55), we divide the hematology data in 
Table 10.1 into two subsamples of sizes nj = 26 and nz = 25 (the first 26 obser- 
vations and the last 25 observations). For the correlation between y and x, in each 
of the two subsamples, we obtain r; = .4994 and rz = .0424. The z transformation 
in (10.51) for each of these two values is given by 


z, = tanh7'r, = .5485, 
zy = tanh 'ry = .0425. 
To test Ho : py = Po, we use the approximate test statistic (10.55) to obtain 


. 5485 — .0425 a 
V1/(26 — 3) + 1/(25 — 3) 


6969. 


Since 1.6969 < zo95 = 1.96, we do not reject Ho. 
To obtain approximate 95% confidence limits for p,, we use (10.57): 


1.96 
Lower limit for p,: tanh (.s48s = ) = .1389, 
V23 


1.96 
Upper limit for p,: tanh{ .5485 + —)]} = .7430. 


For p>, the limits are given by 


1 
Lower limit for p, : tan. 0425 — =) = —.3587, 
V22 


1.96 
Upper limit for p,:  tanh{ .0425 + ——) = .4303. 


10.6 EFFECT OF EACH VARIABLE ON R? 


The contribution of a variable x; to the multiple correlation R will, in peenerl be 
different from its DINer aie éorelation with y; that is, the increase in R? when Xx; is 
added is not equal to re . This increase in R> can be either more or less than : 

It seems clear that velanonsnis with other variables can aden a variable partially 
redundant and thereby reduce the contribution of x; to R?, but it is not intuitively 
apparent how the contribution of x; to R? can exceed ine The latter phenomenon 


has been illustrated numerically by Flury (1989) and Hamilton (1987). 


10.6 EFFECT OF EACH VARIABLE ON R? 263 


In this section, we provide a breakdown of the factors that determine how much 
each variable adds to R* and show how the increase in R”? can exceed Tee 
(Rencher 1993). We first introduce some notation. The variable of interest ‘is 
denoted by z, which can be one of the x’s or a new variable added to the x’s. We 
make the following additional notational definitions: 


Re = squared multiple correlation between y and w = (x1,X2, ...,X%,Z)'- 
R2.= squared multiple correlation between y and x = (x1,%2, ... 5X4)’. 


2 el ie ; : 
R:, = 8,8, Scx/S; = squared multiple correlation between z and x. 
ry; = simple correlation between y and z . 

Tyx = yx Tym. ++ Tyx,). = vector of correlations between y and x. 


Vor = (Pays Tons +++ lo,) = vector of correlations between z and x. 


Bi. = R,! r,, is the vector of standardized regression coefficients (beta weights) 


of z regressed on x [see (10.24)]. 


The effect of z on R? is formulated in the following theorem. 


Theorem 10.6. The increase in R* due to z can be expressed as 


(10.58) 


where 7, = Bi ry. is a “predicted” value of r,, based on the relationship of z to 
the x’s. 


Proor. See Problem 10.19. 


Since the right side of (10.58) is positive, R? cannot decrease with an additional 
variable, which is a verification of property 3 in Section 7.7. If z is orthogonal to x 
(ie., if r= 0), then Bi. = 0, which implies that 7,, = 0 and R2, = 0. In this case, 
(10.58) can be written as ae = Re + eA which verifies property 5 of Section 7.7. 

It is clear in Theorem 10.6 that the contribution of z to R? can either be less than or 
greater than a If 7,, is close to r,,, the contribution of z is less than ie There are 
three ways in which the contribution of z can exceed ee (1) #,, is substantially 
larger in absolute value than r,., (2) #,, and r,, are of opposite signs, and (3) Re 
is large. 

In many cases, the researcher may find it helpful to know why a variable contrib- 
uted more than expected or less than expected. For example, admission to a university 
or professional school may be based on previous grades and the score on a standar- 
dized national test. An applicant for admission to a university with limited enrollment 


would submit high school grades and a national test score. These might be entered 


264 MULTIPLE REGRESSION: RANDOM x’s 


into a regression equation to obtain a predicted value of first-year grade-point average 
at the university. It is typically found that the standardized test increases R? only 
slightly above that based on high school grades alone. This small increase in R? 
would be disappointing to admissions officials who had hoped that the national 
test score might be a more useful predictor than high school grades. The designers 
of such standardized tests may find it beneficial to know precisely why the test 
makes such an unexpectedly small contribution relative to high school grades. 

In Theorem 10.6, we have available the specific information needed by the 
designer of the standardized test. To illustrate the use of (10.58), let y be the 
grade-point average for the first year at the university, let z be the score on the stan- 
dardized test, and let x,, x2,..., x; be high school grades in key subject areas. By 
(10.58), the increase in R? due to z is (Az - Tyg) /a— Re: in which we see that z 
adds little to R* if 7, is close to r,,. We could examine the coefficients in 
iy = Bery, to determine which of the 7,,,’s in r,, have the most effect. This infor- 
mation could be used in redesigning the questions so as to reduce these particular 
Tyx,’S. It may also be possible to increase the contribution of z to Rey by increasing 
Re (thereby reducing | — R2,). This might be done by designing the questions in 
the standardized test so that the test score z is more correlated with high school 
grades, X1, X2, --., Xg- 

Theil and Chung (1988) proposed a measure of the relative importance of a vari- 
able in multiple regression based on information theory. 


Example 10.6. For the hematology data in Table 10.1, the overall Row was found in 
Example 10.4 to be .92318. From Theorem 10.6, the increase in R* due to a variable z 
has the breakdown RB — Ry = (Fy, — rey /Aa- ee) where z represents any one of 
X1,X2,...,X5, and x represents the other four variables. The values of 7y,, ryz, R.., 


can — Res, and F are given below for each variable in turn as z: 


Zz Pye Vy: R2, Row - Re F p value 
xy 2101 1943 .6332 .00068 0.4 53 
XxX. ~—-.2486 2210 6426 00213 1.25 .26 
x3 .0932 7890 4423 86820 508.6 0 

X4 4822 .0526 3837 29945 175.4 0 

Xs .0659 .0758 .0979 .00011 0.064 81 


The F value is from the partial F test in (8.25), (8.37), or (8.39) for the significance of 
the increase in R* due to each variable. 

An interesting variable here is x4, whose value of r,, is .0526, the smallest among 
the five variables. Despite this small individual correlation with y, x4 contributes 
much more to Row than do all other variables except x3 because 7,, is much greater 
for x4 than for the other variables. This illustrates how the contribution of a variable 
can be augmented in the presence of other variables as reflected in /,,. 

The difference between the two major contributors x3 and x4 may be very revealing 
to the researcher. The contribution of x3 to Re is due mostly to its own correlation 


10.7. PREDICTION FOR MULTIVARIATE NORMAL OR NONNORMAL DATA 265 


with y, whereas virtually all the effect of x, comes from its association with the other 
variables as reflected in /,,. 


10.7 PREDICTION FOR MULTIVARIATE NORMAL OR 
NONNORMAL DATA 


In this section, we consider an approach to modeling and estimation in the random-x 
case that is somewhat reminiscent of least squares in the fixed-x case. Suppose that 
(y, x’) = (9, %1,%2,-.-,X~) iS not necessarily assumed to be multivariate normal 
and we wish to find a function ¢(x) for predicting y. In order to find a predicted 
value t(x) that is expected to be “close” to y, we will choose the function f(x) that 
minimizes the mean squared error E[ y — t(x)|?, where the expectation is in the 
joint distribution of y, x1, ...,x,. This function is given in the following theorem. 


Theorem 10.7. For the random vector (y,x’), the function f(x) that minimizes the 
mean squared error E[ y — t(x)]° is given by E(y|x). 


Proor. For notational simplicity, we use k= 1. By (4.28), the joint density g(y, x) 
can be written as g(y, x) =f (y|x~)h(x). Then 


Ely — 12 = \| Ly — 1@)P ely, 2) dy dx 
é \| Ly — 1) PF(yboh@) dy de 
zs [aeof | Ly —1@) PFO) ay hat 


To find the function f(x) that minimizes E(y — iv. we differentiate with respect to ¢ 
and set the result equal to 0 [for a more general proof not involving differentiation, 
see Graybill (1976, pp. 432-434) or Christensen (1996, p. 119)]. Assuming that 
we can interchange integration and differentiation, we obtain 


Ely — 1x) 
er = [neof [vey — 10)] fobody} a =0, 


which gives 


2 | h(x) | wflady — [roar sbods| dx = 0, 
2 | h(x)[E(y|x) — tQ)|dx = 0. 


The left side is 0 if 
t(x) = E(y|x). 


266 MULTIPLE REGRESSION: RANDOM x’s 


In the case of the multivariate normal, the prediction function E(y|x) is a linear 
function of x [see (10.4) and (10.5)]. However, in general, E(y|x) is not linear. For 
an illustration of a nonlinear E(y|x), see Example 3.2, in which we have 
E(y|x) = (1 + 4x — 2x7). 

If we restrict ¢(x) to linear functions of x, then the optimal result is the same linear 
function as in the multivariate normal case [see (10.6) and (10.7)]. 


Theorem 10.7b. The linear function t(x) that minimizes E[ y — t(x)]’ is given by 
t(x) = By + Bx, where 


Bo = By — Oy Bae Mes (10.59) 


pr= Ss oR. (10.60) 


Proor. See Problem 10.21. 


We can find estimators By and B, for By and B; in (10.59) and (10.60) by mini- 
mizing the sample mean squared error, )~;_, (yi — Bo — Bi x; /n. The results are 
given in the following theorem. 


Theorem 10.7c. If (91, x), (¥2,X5), ---,(0n,X,,) is a random sample with mean 
vector and covariance matrix 


By = 9 — 8,85) %, (10.61) 
B, = Sz '5)x. (10.62) 


Proor. See Problem 10.22. 


The estimators Bo and Bi in (10.61) and (10.62) are the same as the maximum like- 
lihood estimators in the normal case [see (10.15) and (10.16)]. 


10.8 SAMPLE PARTIAL CORRELATIONS 


Partial correlations were introduced in Sections 4.5 and 7.10. Assuming multivariate 
normality, the population partial correlation p;.,.... is the correlation between y; and y; 
in the conditional distribution of y given x, where y; and y, are in y and the 


10.8 SAMPLE PARTIAL CORRELATIONS 267 


subscripts r,s,...,g represent all the variables in x. By (4.36), we obtain 


Oij-rs...q (10.63) 


> 
Oji-rs...qgF jj- rs ...q 


Pij-rs..g = 


where 07. ;s...q, is the (ij) element of %).. = cov(y|x). For normal populations, ,.. is 
given by (4.27) as Lyx = Ly — be ey xy, Where Ly, Byx, Bex, and Z,, are from 
the partitioned covariance matrix 


Y\_y_ (%y Xx 
MG) ale 
[see (3.33)]. The matrix of (population) partial correlations p;.,,__, can be found by 
(4.37): 


P= Dy eD, =D, Cy - 22 =D, (10.64) 


where D,.. = [diag(&,.,)]'/. 

To obtain a maximum likelihood estimator Ry... = (Fj-rs...q) Of Py. = izes. 4) in 
(10.64), we use the invariance property of maximum likelihood estimators (Theorem 
10.2b) to obtain 


R)x = Dy! (S,) — SyxSz'S,)De1, (10.65) 
where 


D, = [diag(S,, — S),S_'Sy)]'/. 


The matrices S,,,S,,,S,,, and S,, are from the sample covariance matrix parti- 
tioned by analogy to } above 


where 
2 
on Sy yo Syiyp 
2 
Sy Sy, "tt Syoyp 
Syy = : . ; and 
2 
Sypyi Sypya Sy, 
Syix, Sy x9 Syixq 
Syox  Syoxo Syoxq 


268 MULTIPLE REGRESSION: RANDOM x’s 


are estimators of &,, and %,,. Thus the maximum likelihood estimator of Pi Feige 
(10.63) is rj.;s...g, the (ij) th element of R,., in (10.65). 

We now consider two other expressions for partial correlation and show that they 
are equivalent to rj.;s...q in (10.65). To simplify exposition, we illustrate with 12.3. 
The sample partial correlation of y; and yz with y3 held fixed is usually given as 


ri2 — 113123 
(1 = rj3)( = 733) 


7123 = (10.66) 


where 7j2, 713, and r23 are the ordinary correlations between y, and y>, y,; and y3, and 
yz and y3, respectively. In the following theorem, we relate 2.3 to two previous defi- 
nitions of partial correlation. 


Theorem 10.8a. The expression for 717.3 in (10.66) is equivalent to an element of 
R,.. in (10.65) and is also equal to r,,-5,,),-5, from (7.94), where y; — ), and 
y2 — yz are residuals from regression of y, on y3 and y2 on y3. 


Proor. We first consider ry,_5,,),-5,, Which is not a maximum likelihood estimator 
and can therefore be used when the data are not normal. We obtain ¥, and ¥, by 
regressing y, on y3 and y, on y3. Using the notation in Section 7.10, we indicate 
the predicted value of y, based on regression of y; on y3 as ¥j(¥3). With a similar 
definition of (3), the residuals can be expressed as 


uy = y1 — 31093) = 1 — (Bor + Bry) 


un = Yo — Sa(y3) = y2 — (Boo + Birys), 
where, by (6.5), Bu and Bio are the usual least-squares estimators 


er Oni — 9938 — ¥3) 
er (31 — Ys) : 


Bak (10.67) 


ae (yi — Ya)(y3i — Ys) 
i (y3i — Wey 


Boe (10.68) 


Then the sample correlation between u; = y; — ¥;(y3) and uz = yo — ¥2(y3) [see 
(7.94)] is 


Tuyu. = Tyi—-S1,y2—-S2 


_ __ Cov(ui, uo) (10.69) 


JV var(u) )var(u2) 


10.8 SAMPLE PARTIAL CORRELATIONS 269 


Since the sample mean of the residuals uw; and uz is 0 [see (9.11)], ry,y, can be 
written as 


= Uy jU; 

ee Ut aie Bi 

= Doiet (Wi — Ju Y2i — S21) . 
pas (yu — Sua” ey (yi — Sa)” 


Tuju. = 


(10.70) 


We now show that r,,,, in (10.70) can be expressed as an element of R,., in 
(10.65). Note that in this illustration, R,., is 2 x 2. The numerator of (10.70) can 
be written as 


n n 
oD Wy jU2j = Ss (yi — J (y2 — Ya) 
Al 


i=1 


(yu — Bot = Biyori = Be - By2y3i). 


n 

i=l 

Using Bo; = ¥; — Bi1¥3 and Boo = ¥2 — Biz¥3, we obtain 
n 


So wiitai = S> Ly1i — 1 — Buri — Ys) Ly21 — Y2 — Brx(y3 — 5) 
i= 


i=1 


= Deon — F1)(y2i — Fo) — BuBi2 >) Oi — 33). (10.71) 


The other two terms in (10.71) sum to zero. Using (10.67) and (10.68), the second 
term on the right side of (10.71) can be written as 


[SG jG — 9952 Gan 964 = 7) 
oh (31 — 93). 


Bub dL Ow iS 


(10.72) 


If we divide (10.71) by n — 1, divide numerator and denominator of (10.72) by n — 1, 
and substitute (10.72) into (10.71), we obtain 


sas Box , , 813823 
COV(U, U2) = COV(Y1 — 1,92 — Y2) = S12 — ee 
33 


270 MULTIPLE REGRESSION: RANDOM x’s 


This is the element in the first row and second column of S,, — SS," S,, in (10.65), 

where S,, = @ Bl ) Syx = Sy = i ) , Sxx = 533, and S,, = Syx* In this case, 
ve Soy 822 523 : 

the 2 x 2 matrix S,, — S,.S,! Sy is given by 


Sir S12 1 / 513 
Sy — Sy:85,'Sxy - ( ) =o ( )oon.se) 


S21 S22 533 \ $23 
S S 1 a 

ss ( ul a) S73 513823 

= = p ; 
S21 $22 533 \ $23513 553 


Thus 7,,.,, aS based on residuals in (10.69), is equivalent to the maximum likelihood 
estimator in (10.65). 

We now use (10.71) to convert 7,,,, 1 (10.69) into the familiar formula for 712.3 
given in (10.66). By (10.70), we obtain 


2a it (10.73) 


hii = : 
ui ui; Doi U5; 


By an extension of (10.71), we further obtain 


> “iy = S- (yi — 9)" — By s (31 — ¥3)"5 (10.74) 
i=1 i i 
S63, = > On — HY - Bh Y> 31 - Fs)”. (10.75) 
i=1 i i 


Then (10.73) becomes 


de Ou — WO — Ya) — Bi Bi Yo (931 — 93)" 


Tuyu , 
VLE,0u—y0? - 2 D053 1D, 0u-2F - Ow 
(10.76) 


We now substitute for Bu and By as defined in (10.67) and (10.68) and divide 


numerator and denominator by V 2; Oui — ¥)? 4 C24 — Jy)” to obtain 


bs = _ T12 — 113123 
uu 12-3 : 
= VC = 753). = 733) 


Thus 7,,,, based on residuals as in (10.69) is equivalent to the usual formulation 712.3 


(10.77) 


in (10.66). 


10.8 SAMPLE PARTIAL CORRELATIONS 271 


For the general case rj. ;s.._,, where i and j are subscripts pertaining toy andr, s,...,¢ 
are all the subscripts associated with x, we define a residual vector y; — y;(x), where y;(x) 
is the vector of predicted values from the regression of y on x. [Note that 7 is used 
differently in rj.+s..¢ and y; — y,(x).] In Theorem 10.8a, rj2.3 was found to be equal 
to ry,—5,,y.—5» the ordinary correlation of the two residuals, and to be equivalent to 
the partial correlation defined as an element of R,., in (10.65). In the following 
theorem, this is extended to the vectors y and x. 


Theorem 10.8b. The sample covariance matrix of the residual vector y,; — y,(x) is 
equivalent to S,, — SS, S,y in (10.65), that is, S,_; = §,, — 8,8; Sy. 


Proor. The sample predicted value y,(x) is an estimator of E(y|x,;) = By + 


patie (x; — m,) given in (4.26). For y,(x), we use the maximum likelihood estimator 
of E(y|x;,): 


90) =¥ + 8,87! (x; — ¥). (10.78) 


[The same result can be obtained without reference to normality; see Rencher (1998, 
p. 304).] 

Since the sample mean of y; — y;,(x) is 0 (see Problem 10.26), the sample covari- 
ance matrix of y; — y,(x) is defined as 


1 n 
Sy-9 = > [yi — H@Ily; — 9,00! (10.79) 
i=1 


(see Problem 10.1).We first note that by extension of (10.13), we have S,, = So; 


(yi — NY; — V'/@— D, Sx = 0; — Nou — H//(2 — 1), and Sa = 30; i — ¥) 
(x; — x)'/(n — 1) (see Problem 10.27). Using these expressions, after substituting 
(10.78) in (10.79), we obtain 


I< _ = o 2 7 = 
Sy9 =) [Wi — ¥ — SS — Dl; — ¥ — Sy-8_' i — 9] 
i=1 


1 |< : 
= doi — Ne: — VY! — SO — Wi — 8S 
i=1 


at i=l 


—SyS5) 5 (&: — Oy; — ¥) + SwSz' J > & — Ox; — 8S,’ 
i=1 i=1 


= 8, — 8,858.) — SyS5'Sy + S87 SuSz'S,) 
= Sy — SyS;'Siy. 


Thus the covariance matrix of residuals gives the same result as the maximum like- 
lihood estimator of conditional covariances and correlations in (10.65). 


272 MULTIPLE REGRESSION: RANDOM x’s 


Example 10.8. We illustrate some partial correlations for the hematology data in 
Table 10.1. To find 1r;.2345, for example, we use (10.65), Ry. = Dy! (Syy— 
S),S;'S.)D,'. In this case, y = (y,x1)' and x = (%,x3,x4,x5)'. The matrix S is 
therefore partitioned as 


90.290 1.535 4.880 106.202 3.753 3.064 
1.535 0.691 1.494 3.255 0.422 —0.268 


== 4.880 1.494 5.401 10.155 1.374 1.292 
106.202 3.255; 10.155 200.668 64.655 4.067 

3.753 0.422 1.374 64.655 56.374 0.579 

3.064 —0.268 1.292 4.067 0.579 18.078 


= ( Syy Syx ) 
Sy Sx 
The matrix D, = [diag(S,, — S,,S,,'S,))]'/? is given by 


2.645 0 
d,= ( 0 a). 


and we have 


R= 1.0000 —0.0934 
y*"~ \ —0.0934 1.000 /}° 


Thus, 71.2345 = —.0934. On the other hand, rj; = .1934. 
To find ryo.1345, we have y = (y, x2)’ and x = (1,3, x4,.x5)'. Thus 


ee 90.290 4.880 
" \ 4.880 5.401 7’ 


and there are corresponding matrices for S),,S,,, and S,,. The diagonal matrix D, is 
given by D, = diag(2.670, 1.389), and we have 


pr. _( 1.000 0.164 
»* = \ 9.164 1.000 }° 


Thus, 72.1345 = —.164, which can be compared with ryy = .221. 


PROBLEMS 273 


To find ry3.45, we have y = (y,x1,x2,x3)' and x = (4, x5)’. Then, for example, we 


obtain 


90.290 1.535 4.880 106.202 
1.535 0.691 1.494 3.255 
4.880 1.494 5.401 10.155 

106.202 3.255 10.155 200.668 


Sy = 


The diagonal matrix D, is given by 


D, = diag(9.462, .827, 2.297, 11.219), 


and we have 


1.000 0.198 0.210 0.954 
0.198 1.000 0.792 0.304 
0.210 0.792 1.000 0.324 
0.954 0.304 0.324 1.000 


Ry = 


Thus, for example, Ty1-45 = 198, ry3.45 = 954, 112.45 = .792, and 123.45 = 324. In 
this case, R,., is little changed from R,,: 


1.000 0.194 0.221 0.789 
0.194 1.000 0.774 0.277 
0.221 0.774 1.000 0.308 
0.789 0.277 0.308 1.000 


Ry = 


PROBLEMS 


10.1 


10.2 


10.3 
10.4 


10.5 


10.6 
10.7 


Show that S in (10.14) can be found as S = Pe) (v; — v)(v; — Vy’ /(m — 1) as 
in (10.13). 


Show that f and S in (10.9) and (10.10) are jointly sufficient for ys and &, as 
noted following Theorem 10.2c. 


Show that S = DRD gives the partitioned result in (10.19). 
Show that cov(y, w) = JS Gy, and var(w) = OD Gy, as in (10.26), 
where w = py + o>, (x — p,). 


Show that Prix in (10.27) is the maximum squared correlation between y and 
any linear function of x, as in (10.28). 


Show that ae can be expressed as Poi = 1 — ||/(@%y|2xx|) as in (10.29). 


Show that Pols is invariant to linear transformations u = ay and v = Bx, where 
B is nonsingular, as in (10.30). 


274 


10.8 
10.9 


10.10 
10.11 
10.12 


10.13 


10.14 


10.15 


10.16 
10.17 
10.18 


10.19 
10.20 


10.21 
10.22 


MULTIPLE REGRESSION: RANDOM x’s 


Show that cov(y — w, x) = 0’ as in (10.33). 


. 2, _ 2, . . . eae 2 ‘, 
Verify that Re = Ty» aS In (10.36), using the following two definitions of ios 


@) = [Lh Or - WG -I]/O 1 Pr DL, Gi - 7] 
(b) 49 = Sy5/(Sy55) 
Show that R? = max,r2,,. as in (10.37). 


y,a/x 


Show that R? = r’ Ro’ ry, as in (10.38) . 


XOX 
Show that R? = 1 — 1/7” as in (10.39), where 7°” is the upper left-hand diag- 
onal element of R~!, with R partitioned as in (10.18). 


Verify that R* can be expressed in terms of determinants as in (10.40) and 
(10.41). 


Show that R? is invariant to full-rank linear transformations on y or the x’s, as 
in property 9 in Section 10.4. 


Show that Xo in (10.46) is the maximum likelihood estimator of Xo in (10.45) 
and that maxy,L(p, &) is given by (10.47). 


Show that LR in (10.48) is equal to LR = (1 — Ryn? in (10.49). 
Obtain the confidence interval in (10.57) from the inequality in (10.56). 


Suppose that we have three independent samples of bivariate normal data. 
The three sample correlations are r;, 72, and r3 based, respectively, on 
sample sizes 11, No, and n3. 


(a) Find the covariance matrix V of z=(z m2 Zz) where 
z= sf +7)/0 - 7). 
(b) Let x, = (tanh~'p,, tanh'p,, tanh~'p;), and let 


1-1 0 
c=({ 0 oy) 


Find the distribution of [C(z — p,)] [Cvc’}|[Cc@— B.)]. 
(c) Using (b), propose a test for Ho: p; = p2 = p3 or equivalently 
Ao : Cp, = 0. 


Prove Theorem 10.6. 


Show that if z were orthogonal to the x’s, (10.58) could be written in the form 
Rwy = ie + ee as noted following Theorem 10.6. 


Prove Theorem 10.7b. 


Prove Theorem 10.7c. 


10.23 


10.24 
10.25 
10.26 
10.27 


10.28 


10.29 


PROBLEMS 275 
Show that S77) wiillas = Oj) (i — ¥)(24 — Fa) — Buh 1 O31 — 93)? 
as in (10.71). 
Show that 37", u?, = 7, Qu — 51, — BR, 2, (93: — 93)” as in (10.74). 
Obtain r}2.3 in (10.77) from r,,,, in (10.76). 
Show that >” 


i= 


Show that Sy = >0;(y; — yy; — V/@— D, Se = 0%; -— Yi — 0)/ 
(n — 1), and Sy, = 5°; (x — X)(x; — ¥)'/(n — 1), as noted following (10.79). 


ily; — y,(x)] = 0, as noted following (10.78). 


In an experiment with rats, the concentration of a particular drug in the liver 
was of interest. For 19 rats the following variables were observed: 


y = percentage of the dose in the liver 
x, = body weight 
X2 = liver weight 


x3 = relative dose 


The data are given in Table 10.2 (Weisberg 1985, p. 122). 


(a) Find Syx,8)x,B;, Bo, and s?. 
(b) Find Ry, ryy, and B¥. 

(c) Find R?. 

(d) Test Ho: B, =90. 


Use the hematology data in Table 10.1 as divided into two subsamples of 
sizes 26 and 25 in Example 10.5b (the first 26 observations and the last 25 
observations). For each pair of variables below, find r; and rz for the two sub- 
samples, find z, and z2 as in (10.51), test Ho : p, = py as in (10.55), and find 
confidence limits for p; and p as in (10.57). 


(a) yand x, 
(b) y and x3 


TABLE 10.2 Rat Data 


y x] x2 x3 y x] x2 X3 
42 176 6.5 0.88 eH | 158 6.9 .80 
25 176 9.5 0.88 36 148 73 74 
56 190 9.0 1.00 21 149 52 715 
23 176 8.9 0.88 28 163 8.4 81 
23 200 7.2 1.00 34 170 2: 85 
32 167 8.9 0.83 28 186 6.8 94 
37 188 8.0 0.94 30 164 73 73 
Al 195 10.0 0.98 37 181 9.0 .90 
33 176 8.0 0.88 46 149 6.4 75 
38 165 79 0.84 


276 


10.30 


10.31 


MULTIPLE REGRESSION: RANDOM x’s 


(c) yand x4 
(d) yand xs 


For the rat data in Table 10.2, check the effect of each variable on R’ as in 
Section 10.6. 
Using the rat data in Table 10.2. 


(a) Find ryj.23 and compare to ry. 

(b) Find Ty2.13 

(c) Find R,.,, where y = (y, x1, x2) and x = x3, in order to obtain Ty1-35Ty23, 
and 112.3. 


1 1 Multiple Regression: Bayesian 
Inference 


We now consider Bayesian estimation and prediction for the multiple linear 
regression model in which the x variables are fixed constants as in Chapters 7-9. 
The Bayesian statistical paradigm is conceptually simple and general because infer- 
ences involve only probability calculations as opposed to maximization of a function 
like the log likelihood. On the other hand, the probability calculations usually entail 
complicated or even intractable integrals. The Bayesian approach has become popular 
more recently because of the development of computer-intensive approximations to 
these integrals (Evans and Swartz 2000) and user-friendly programs to carry out 
the computations (Gilks et al. 1998). We discuss both analytical and computer- 
intensive approaches to the Bayesian multiple regression model. 

Throughout Chapters 7 and 8 we assumed that the parameters B and o were 
unknown fixed constants. We couldn’t really do otherwise because to this point (at 
least implicitly) we have only allowed probability distributions to represent variability 
due to such things as random sampling or imprecision of measurement instruments. 
The Bayesian approach additionally allows probability distributions to represent con- 
jectural uncertainty. Thus £6 and o can be treated as if they are random variables 
because we are uncertain about their values. The technical property that allows one 
to treat parameters as random variables is exchangeability of the observational 
units in the study (Lindley and Smith 1972). 


11.1 ELEMENTS OF BAYESIAN STATISTICAL INFERENCE 


In Bayesian statistics, uncertainty about the value of a parameter is expressed using 
the tools of probability theory (e.g., a density function—see Section 3.2). Density 
functions of parameters like B and o reflect the current credibility of possible 
values for these parameters. The goal of the Bayesian approach is to use data to 
update the uncertainty distributions for parameters, and then draw sensible 
conclusions using these updated distributions. 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


277 


278 MULTIPLE REGRESSION: BAYESIAN INFERENCE 


The Bayesian approach can be used in any inference situation. However, it seems 
especially natural in the following type of problem. Consider an industrial process in 
which it is desired to estimate By and f, for the straight-line relationship in (6.1) 
between a response y and a predictor x for a particular batch of product. Suppose 
that it is known from experience that By and 6, vary randomly from batch to 
batch. Bayesian inference allows historical (or prior) knowledge of the distributions 
of Bo and 6, among batches to be expressed in probabilistic form, and then to be com- 
bined with (x, y) data from a specific batch in order to give improved estimates of Bo 
and £, for that specific batch. 

Bayesian inference is based on two general equations. In these equations as pre- 
sented below, 6 is a vector of m continuous parameters, y is a vector of n continuous 
observations, and f, g, h, k, p, qg, r and t are probability density functions. 

We begin with the definition of the conditional density of 8 given y [see (3.18)] 


k(y, 8) 


élvy)= 
g(9|y) hy)” 


(11.1) 


where k(y, 8) is the joint density of y,, yo,..., y, and 6,,62,...,6,,. Using the defi- 
nition of the conditional density f(y | 8), we can write k(y, 8) = f(y|0)p(@), and 
(11.1) becomes 


Fy OPO) 


0 = ’ 
g(Oly) ity) 


(11.2) 


an expression that is commonly referred to as Bayes’ theorem. By an extension of 
(3.13), the marginal density A(y) can be obtained by integrating @ out of 
k(y,0) = f(y |) p(@) so that (11.2) becomes 


g(@ly) = Fal) 
fo J FO|@p(@ae 


—0o0 


= cf(y| ®p(9), (11.3) 


where d0 = d6, --- d6,,. In this expression, p(@) is known as the prior density of 9, 
and g(@|y) is called the posterior density of 0. The definite integral in the denomi- 
nator of (11.3) is often replaced by a constant (c) because after integration, it no 
longer involves the random vector 0. This definite integral is often very complicated, 
but can sometimes be obtained by noting that c is a normalizing constant, that is, a 
value such that the posterior density integrates to 1. Rearranging this expression and 
reinterpreting the joint density function f(y |@) of the data as the likelihood function 
L(0|y) (see Section 7.6.2), we obtain 


(Oly) = cp (8 )L(|y). (11.4) 


11.2 A BAYESIAN MULTIPLE LINEAR REGRESSION MODEL 279 


Thus (11.2), the first general equation of Bayesian inference, merely states that the 
posterior density of 8 given the data (representing the updated uncertainty in @) is pro- 
portional to the prior density of @ times the likelihood function. Point and interval 
estimates of the parameters are taken as mathematical features of this joint posterior 
density or associated marginal posterior densities of individual parameters 6;. For 
example, the mode or mean of the marginal posterior density of a parameter may 
be used as a point estimate of the parameter. A central or highest density interval 
(Gelman et al. 2004, pp. 38-39) over which the marginal posterior density of a par- 
ameter integrates to 1 — w may be taken as a 100(1 — w)% interval estimate of the 
parameter. 

For the second general equation of Bayesian inference, we consider a future obser- 
vation yo. In the Bayesian approach, yo is not independent of y as was assumed in 
Section 8.6.5 because its density depends on 6, a random vector whose current uncer- 
tainty depends on y. Since yo, y and @ are jointly distributed, the posterior predictive 
density of yo given y is obtained by integrating 0 out of the joint conditional density 
of yo and @ given y: 


| 


r(yoly) = t(yo, Oly)d0 


| | qvo|9, y)g(Oly)dO [by (4.28)] 

where g(yo|9, y) is the conditional density function of the sampling distribution for 
a future observation yo. Since yo is dependent on y only through 9@, g(yo|9, y) 
simplifies, and we have 


r(yoly) = | 2a | g(vol8)9(8 ly) a0. (11.5) 


Equation (11.5) expresses the intuitive idea that uncertainty associated with the pre- 
dicted value of a future observation has two components: sampling variability and 
uncertainty in the parameters. As before, point and interval predictions can be 
taken as mathematical features (such as the mean, mode, or specified integral) of 
this posterior predictive density. 


11.2 A BAYESIAN MULTIPLE LINEAR REGRESSION MODEL 


Bayesian multiple regression models are similar to the classical multiple regression 
model (see Section 7.6.1) except that they include specifications of the prior 


280 MULTIPLE REGRESSION: BAYESIAN INFERENCE 


distributions for the parameters. Prior specification is an important part of the art and 
practice of Bayesian modeling, but since the focus of this text is the basic theory of 
linear models, we discuss only one set of prior specifications—one that is chosen for 
its mathematical convenience rather than actual prior information. 


11.2.1 A Bayesian Multiple Regression Model with a Conjugate Prior 


Although not necessary, it is often convenient to parameterize Bayesian models using 
precision (7) rather than variance (o°), where 


Using this parameterization, as an example of a Bayesian linear regression model, let 


1 
y|B, 7 be n( XB 1), 
T 


1 
B\7 be Malo v) ’ 
7 be gamma(a, 6). 


The second and third distributions here are prior distributions, and we assume that @, 
V, a, and 6 (the parameters of the prior distributions), are known. Although we will 
not do so here, this model could be extended by specifying hyperprior distributions 
for @, V, a, and 6 (Lindley and Smith 1972). 

As in previous chapters, the number of predictor variables is denoted by k (so that 
the rank of X is k +1) and the number of observations by n. The prior density 
function for B|7 is, using (4.9) 


1 em 
= —1B-bV"(B-9)/2, U1. 
Pi(B|7) Om D2) IVE Ch) 


The prior density function for 7 is the gamma density (Gelman et al. 2004, 
pp. 574-575) 


6* 


—1 -8r 
Tea” et (11.7) 


p2(t) = 
where a > 0,6 > 0, and by definition 


T(a) = [xrletas (11.8) 
0 


11.2 A BAYESIAN MULTIPLE LINEAR REGRESSION MODEL 281 


(see any advanced calculus text). For the gamma density in (11.7), 


EQ) =5 and var(a) = 5. 


These prior distributions could be formulated with small enough variances that the 
prior knowledge strongly influences posterior distributions of the parameters in 
the model. If so, they are called informative priors. On the other hand, both of 
these priors could be formulated with large variances so that they have very little 
effect on the posterior distributions. If so, they are called diffuse priors. The priors 
would be diffuse if, for example, V in (11.6) were a diagonal matrix with very 
large diagonal elements, and if 6 in (11.7) were very close to zero. 

The prior specifications in (11.6) and (11.7) are flexible and reasonable, and they 
also have nice mathematical properties, as will be shown in Theorem 11.2a. Other 
specifications for the prior distributions could be used. However, even the minor 
modification of proposing a prior distribution for B that is not conditional on 7 
makes the model far less mathematically tractable. 

The joint prior for B and 7 in our model is called a conjugate prior because its use 
results in a posterior distribution of the same form as the prior. We prove this in the 
following theorem. 


Theorem 11.2a. Consider the Bayesian multiple regression model in which y|B, 7 
is N,(XB, 7 'D, Bit is Neild, 7 !V), and 7 is gamma(a, 5). The joint prior distri- 
bution is conjugate, that is, g(B, tly) is of the same form as p(f, 7). 


Proor. Combining (11.6) and (11.7), the joint prior density is 


P(B, 7) = pi(B|7)p2(7) 
= cht 2—-78 $V '(B—)/2 ,a-1 4 8r 


= Cet D/2 9-8 VB + 5.1/2 (11.9) 


where a, = 2a — 2, 6, = 26 and all the factors not involving random variables are 
collected into the normalizing constant c,. Using (11.4), the joint posterior density 
is then 


3(B, Tly) = cp(B, DL(B, 7| y) 
= 65 eH D/2 9A B-BYV'(B—G)4 81/2 g11/2 9 1y-XB) (Y-XB)/2 


= Cy hee th+D/2 9p AB-BYV"'(B- B)+ YB) (Y—XB)+ 6.1/2 


where a, = 2a@ — 2 +n, and all the factors not involving random variables are col- 
lected into the normalizing constant c2. By expanding and completing the square in 


282 MULTIPLE REGRESSION: BAYESIAN INFERENCE 


the exponent (Problem 11.1), we obtain 


9(B, T| y) = cor ett D/29- MB PIV (B-b)4 31/2, (11.10) 
where V, = (V-! + X’X) |, = V.(V'@ + X’y), and 6. = —fV~'0,4+ 
in + + 0,. Hence the joint posterior densit as exactly the same form 
‘Vid+yy+6,.H he joint posterior density h ly th fe 
as the joint prior density in (11.9). 


It might seem odd to include terms like X’y and y’y in the “constants” of a prob- 
ability distribution, while considering parameters like B and 7 to be random, but this 
is completely characteristic of Bayesian inference. In this sense, inference in a 
Bayesian linear model is opposite to inference in the classical linear model. 


11.2.2 Marginal Posterior Density of 6 


In order to carry out inferences for B, the marginal posterior density of B [see (3.13)] 
must be obtained by integrating 7 out of the joint posterior density in (11.10). The 
following theorem gives the form of this marginal distribution. 


Theorem 11.2b. Consider the Bayesian multiple regression model in which y|B, 7 
is N,(XB, 7 'D, Blt is Npii(, 7 !V), and 7 is gamma(a, 5). The marginal posterior 
distribution u(Bly) is a multivariate r distribution with parameters (n + 2a, @,, W..), 
where 


ob, =(V-' + X’X) !(V 1h + X’y) (11.11) 


and 


_ [ly — Xo) + XVX’)|(y — Xo) + 28 


Ww. 
n+2a 


Jon sa <>. 6 ee (11.12) 


Proor. The marginal distribution of Bly is obtained by integration as 


u(Bly) = | ee tly)dr. 
0 


By (11.10), this becomes 


u(Bly) = es ee 1)/2g-MB-b.V(B-b.)+8..12 qq, 
0 


11.2 A BAYESIAN MULTIPLE LINEAR REGRESSION MODEL 283 


Using (11.8) together with integration by substitution, the integral in this expression 
can be solved (Problem 11.2) to give the posterior distribution of Bly as 


= (Oey +24+k+41)/2 


u(Bly) = oT ( 


Ose F2tK+ ) ce b,)'Vi'(B- &) +8. 
2 2 


=c3[(B— $,)'V,'(B- 6) -8V,'b,4+9V  b+y y+ bp ore, 


To show that this is the multivariate t density, several algebraic steps are required as 
outlined in Problems 11.3a—c and 11.4. See also Seber and Lee (2003, pp. 100-110). 
After these steps, the preceding expression becomes 


u(Bly) = cs[(B— $,)'V,'(B— &,)+(y —X@) (I+ XVX’) (y—X@) 
sie QS] eet 2th 1/2, 


Dividing the expression in square brackets by (y— X@)' (I+ XVX’)(y— Xq@) +26, 
modifying the normalizing constant accordingly, and replacing a, by 2a—2 +n, we 
obtain 


_ Iy-! = 5) —(n+2a+k+1)/2 
uth) oe B= IV B 20 
ly —X) (I+ XVX')'(y—X@) + 26]/(n+ 2a) 
1 = xyx7-1 = —(nt+2a+k+1)/2 
aa amt a 6 a : (11.13) 


where W,, is as given in (11.12). The expression in (11.13) can now be recognized as 
the density function of the multivariate ¢ distribution (Gelman et al. 2004, pp. 576- 
577; Rencher 1998, p. 56) with parameters (n+ 2a, @,,W.). Note that @, is the 
mean vector and [(n+2a)/(n+ 2a@—2)]W,, is the covariance matrix of Bly. 


As a historical note, the reasoning in this section is closely related to the work of 
W. S. Gosset or “Student” (Pearson et al. 1990, pp. 49-53, 72—73) on the small- 
sample distribution of 


un |<! 


Gosset used Bayesian reasoning (“inverse probability”) with a uniform prior distri- 
bution (“equal distribution of ignorance”) to show through a combination of proof, 
conjecture, and simulation that the posterior density of t is related to what we now 
call Student’s ¢ distribution with n — 1 degrees of freedom. 


284 MULTIPLE REGRESSION: BAYESIAN INFERENCE 


11.2.3. Marginal Posterior Densities of 7 and o* 


Inferences regarding 7 and o require knowledge of the marginal posterior distri- 
bution of z/y. We derive the posterior density of z|y in the following theorem. 


Theorem 11.2c. Consider the Bayesian multiple regression model in which y|B,7 
is N,(XB,7'D, Blt is Nesi(b, 7!V), and + is gamma(a, 6). The marginal 
posterior distribution v(z|y) is a gamma distribution with parameters a +n/2 
and (—.V,'b, + &V'h+y'y + 26)/2, where V. =(V~!+X’x)! and 
h, = V.(V' b+ X’y). 


Proor. The marginal distribution of tly is obtained by integration as 


v(z|y) = | ve | 8(B, tly)dB 


= | is | plex Fk+1)/2 9—M(B-,)'V.- 1B $.)+ 3-1/2 GB 


= Cy fOr tht D/2 9-1 0/2 | ia | oe WB-bV.'(B-$)/2 9B 


where all the factors not involving random variables are collected into the normaliz- 
ing constant cz as in (11.10). Since the integral in the preceding expression is 
proportional to the integral of a joint multivariate normal density, we obtain 


v(tly) = cy rte tH /2¢ (550/209 (9 gy @+D/2 Ny, [1/2 gO 0/2 


= cg 7lewtk 1)/2—(k+ 1/2. p— (See 2) 


= cg Hetm/2 1, ( PV b.4+¢'V bty'y+28)/2]7, (11.14) 


which is the density function of the specified gamma distribution. 


The marginal posterior density of o can now be obtained by the univariate 
change-of-variable technique (4.2) as 


wo" |y) = c6(o7) tHe $.Vi'6.46'V"! b+y'y425)/21/0? (11.15) 


which is the density function of the inverse gamma distribution with parameters 
a+tn/2 and (-&V7'6,+ ¢V'b+4+ yy +25)/2 (Gelman et al. 2004, 
pp. 574-575). 


11.3. INFERENCE IN BAYESIAN MULTIPLE LINEAR REGRESSION 285 


11.3 INFERENCE IN BAYESIAN MULTIPLE 
LINEAR REGRESSION 
11.3.1 Bayesian Point and Interval Estimates of Regression Coefficients 


A sensible Bayesian point estimator of B is the mean of the marginal posterior density 
in (11.13) 


oh, = (V-! + X'X) '(V' + X’y), (11.16) 


and a sensible 100(1 — w)% Bayesian confidence region for is the highest-density 
region © such that 


Nyy-1 —(n+2a+k+1)/2 
Q n+2a 


A convenient property of the multivariate ¢ distribution is that linear functions of 
the random vector follow the (univariate) ¢ distribution. Thus, given y, 


a’ B ae a’ ep, 


a'W.a is t(n+ 2a) 


and, as an important special case, 


Bi~ Ou Gg t(n + 2a), (11.18) 


Wii 


where ¢,,; is the ith element of &, and w,,; is the ith diagonal element of W,.. Thus a 
Bayesian point estimate of 6; is d, ; and a 100(1 — w)% Bayesian confidence interval 
for B; is 


Gj + bw /2,n+2aWsii- (11.19) 


One very appealing aspect of Bayesian inference is that intervals like (11.19) have 
a natural interpretation. Instead of the careful classical interpretation of a confidence 
interval in terms of hypothetical repeated sampling, one can simply and correctly say 
that the probability is 1 — @ that ; is in (11.19). 

An interesting final note on Bayesian estimation of B is that the Bayesian estimator 
@,, in (11.16) can be obtained as the generalized least-squares estimator of 6B in 
(7.63). To see this, consider adding the prior information to the data as if it constituted 
a set of additional observations. The idea is to augment y with @, and to consider the 


y 


to be, respectivel 
o ) 19 y 


mean vector and covariance matrix of the augmented vector 


x 1/1 oO 
ae and ~(¢ a 


286 MULTIPLE REGRESSION: BAYESIAN INFERENCE 


Generalized least squares estimation expressed in terms of these partitioned 
matrices then gives @, in (11.16) as an estimate of B (Problem 11.6). The implication 
of this is that prior information on the regression coefficients can be incorporated into 
a multiple linear regression model by the intuitive informal process of “adding” 
observations. 


11.3.2 Hypothesis Tests for Regression Coefficients in 
Bayesian Inference 


Classical hypothesis testing is not a natural part of Bayesian inference (Gelman et al. 
2004, p. 162). Nonetheless, if the question addressed by a classical hypothesis test is 
whether the data support the conclusion (i.e., alternative hypothesis) that 8; is greater 
than Bj, a sensible approach is to use the posterior distribution (in this case the 
t distribution with n + 2a degrees of freedom) to compute the probability 
P(un +2a) > eee, 
Wii 

The larger this probability is, the more credible is the hypothesis that B; > Bio. 

If, alternatively, classical hypothesis testing is used to select a model from a set of 
candidate models, the corresponding Bayesian approach is to compute an information 
Statistic for each model in question. For example, Schwarz (1978) proposed the 
Bayesian Information Criterion (BIC) for multiple linear regression models, and 
Spiegelhalter et al. (2002) proposed the Deviance Information Criterion (DIC) for 
more general Bayesian models. The model with the lowest value of the information 
criterion is selected. Model selection in Bayesian analysis is an area of current research. 


11.3.3. Special Cases of Inference in Bayesian Multiple 
Regression Models 


Two special cases of inference in this Bayesian linear model are of particular interest. 
First, consider the use of a diffuse prior. Let @ = 0, let V be a diagonal matrix with 
all diagonal elements equal to a large constant (say, 10°), and let a and 6 both be 
equal to a small constant (say, 10°). In this case, V_' is close to O, and so @,, 
the Bayesian point estimate of B in (11.16), is approximately equal to 


(XX) 'X’y, 
the classical least-squares estimate. Also, since (I + XVX’)~! = I— X(X’/X+ V~!)-!x’ 
(see Problem 11.3a), the covariance matrix W,, approaches 


_ y[l— X(X’X)!X’] 
n 


w. Yox'xy | 


= nT eax'xy! [by (7.26)]. 


11.3. INFERENCE IN BAYESIAN MULTIPLE LINEAR REGRESSION 287 


Thus, in the case of diffuse priors, the Bayesian confidence region (11.17) 
reduces to a region similar to (8.46), and Bayesian confidence intervals for the 
regression coefficients in (11.19) are similar to classical confidence intervals in 
(8.47); the only differences are the multiplicative factor (n — 1)/n and the use of 
the ¢ distribution with n degrees of freedom rather than n — k — 1 degrees of 
freedom. If a Bayesian multiple linear regression model with independent uniformly 
distributed priors for B and In(7~') is considered, Bayesian confidence intervals for 
the regression coefficients are exactly equal to classical confidence intervals 
(Problem 11.5). One result of this is that simple Bayesian interpretations can be 
validly applied to confidence intervals for the classical linear model. In fact, most 
inferences for the classical linear model can be stated in terms of properties of pos- 
terior distributions. 

The second special case of inference in this Bayesian linear model is the case in 
which @=0 and V is a diagonal matrix with a constant on the diagonal. Thus 
V = al, where a is a positive number, and the Bayesian estimator of B in (11.16) 
becomes 


/ 1 i / 
XX=1) XxX y. 
a 


For the centered model (Section 7.5) this estimator is also known as the “ridge 
estimator” (Hoerl and Kennard 1970). It was originally proposed as a method for 
dealing with collinearity, the situation in which the columns of the X matrix have 
near-linear dependence so that X’X is nearly singular. However, the estimator 
may also be understood as a “shrinkage estimator” in which prior information 
causes the estimates of the coefficients to be shrunken toward zero (Seber and 
Lee 2003, pp. 321-322). The use of a Bayesian linear model with hyperpriors 
(prior distributions for the parameters of the prior distributions) leads to a reasonable 
choice of value for a in terms of variances of the prior and hyperprior distributions 
(Lindley and Smith 1972). 


11.3.4. Bayesian Point and Interval Estimation of o” 


A possible Bayesian point estimator of o* is the mean of the marginal inverse gamma 
density in (11.15) 


(-PV~'b. + FV 'bt+y'y + 28)/2 
atn/2-1 


and a 100(1 — w)% Bayesian confidence interval for 0° is given by the 1 — w/2 and 
w/2 quantiles of the appropriate inverse gamma distribution. 

As a special case, note that if a and 6 are both close to 0, @ = 0, and V is a diag- 
onal matrix with all diagonal elements equal to a large constant so that V_' is close 


288 MULTIPLE REGRESSION: BAYESIAN INFERENCE 


to O, then the Bayesian point estimator of o is approximately 


(y'y — f.V,',)/2 _ y'y — y'X(X’X) | X’y 


n/2—1 n—2 
_ y[E— X(X’X)!X’'ly 

n—2 

n—-k—-1, 

= pe 


and the centered Bayesian confidence limits are the 1—w/2 quantile and the w/2 
quantile of the inverse gamma distribution with parameters n/2 and 
y [I — X(X'X)"'X"]y/2. 


11.4 BAYESIAN INFERENCE THROUGH MARKOV CHAIN 
MONTE CARLO SIMULATION 


The inability to derive a closed-form marginal posterior distribution for a parameter is 
extremely common in Bayesian inference (Gilks et al. 1998, p. 3). For example, if the 
Bayesian multiple regression model of Section 11.2.1 had involved a prior distri- 
bution for B that was not conditional on 7, closed-form marginal distributions for 
the parameters could not have been derived (Lindley and Smith 1972). In actual prac- 
tice, the exception in Bayesian inference is to be able to derive closed-form marginal 
posterior distributions. However, this difficulty turns out to be only a minor hindrance 
when modern computing resources are available. 

If it were possible, an ideal solution would be to draw a large number of samples 
from the joint posterior distribution. Then marginal means, marginal highest density 
intervals, and other properties of the posterior distribution could be approximated 
using sample statistics. Furthermore, functions of the sampled values could be calcu- 
lated in order to approximate marginal posterior distributions of these functions. The 
big question, of course, is how it would be possible to draw samples from a distri- 
bution for which a familiar closed-form joint density function is not available. 

A general approach for accomplishing this is referred to as Markov Chain Monte 
Carlo (MCMC) simulation (Gilks et al. 1998). A Markov Chain is a special sequence 
of random variables (Ross 2006, p. 185). Probability laws for general sequences of 
random variables are specified in terms of the conditional distribution of the 
current value in the sequence, given all past values. A Markov Chain is a simple 
sequence in which the conditional distribution of the current value is completely 
specified, given only the most recent value. 

Markov Chain Monte Carlo simulation in Bayesian inference is based on sequences of 
alternating random draws from conditional posterior distributions of each of the par- 
ameters in the model given the most recent values of the other parameters. This 
process generates a Markov Chain for each parameter. Moreover, the unconditional dis- 
tribution for each parameter converges to the marginal posterior distribution of the 


11.4 BAYESIAN INFERENCE THROUGH MARKOV CHAIN 289 


parameter, and the unconditional joint distribution of the vector of parameters for any 
complete iteration of MCMC converges to the joint posterior distribution. Thus after dis- 
carding a number of initial draws (the “burn-in’’), draws may be considered to constitute 
sequences of samples from marginal posterior distributions of the parameters. The 
samples are not independent, but the nonindependence can be ignored if the number 
of draws is sufficiently large. Plots of sample values can be examined to determine 
whether a sufficiently large number of draws has been obtained (Gilks et al. 1998). 

When the prior distributions are conjugate, closed-form density functions of the 
conditional posterior distributions of the parameters are available regardless of 
whether closed-form marginal posterior distributions can be derived. In the case of 
conjugate priors, a simple form of MCMC called “Gibbs sampling” (Gilks et al. 
1998, Casella and George 1992) can be used by which draws are made successively 
from each of the conditional distributions of the parameters, given the current draws 
for the other parameters. 

We now illustrate this procedure. Consider again the Bayesian multiple regression 
model in which y|B, Tis N,(XB, 7'D, Blt is Nei1(@, 7 !V), and 7 is gamma(a, 6). 
The joint posterior density function is given in (11.10). The conditional posterior 
density (or “full conditional”) of B|7, y can be obtained by picking the terms out 
of (11.10) that involve B, and considering everything else to be part of the normal- 
izing constant. Thus, the conditional density of B|z, y is 


(Blt, y) = ce BP VeB-$.)/2, 


Clearly B\t, y is Neii1(,, 7 'V.). Similarly, the conditional posterior density for 
7|B, y is 


UT] By y) = C7 Tle tH 3/211 gp AB- B.S V. BG.) Dee/2 


so that 7/B,y can be seen to be gamma _  {(a,,++43)/2, 
[(B— 6.) Vi'(B— &) + 6.1/2}. 


Gibbs sampling for this model proceeds as follows: 


¢ Specify a starting value 7 [possibly 1/ s* from (7.23)]. 

¢ For i=1 to M: draw B; from Nz+1(¢,, ile V,), draw 7; from gamma 
{(Qex +k + 3)/2, [(B; — VB) — &,) + 8441/2}. 

¢ Discard the first Q draws (as burn-in), and consider the last M — Q draws (B;, 7;) 
to be draws from the joint posterior distribution. For this model, using the start- 
ing value of 1/ s°, Q would usually be very small (say, 0), and M would be large 
(say, 10,000). 


Bayesian inferences for all parameters of the model could now be carried out using 
sample statistics of this empirical joint posterior distribution. For example, a Bayesian 
point estimate of 7 could be calculated as the sample mean or median of the draws of T 
from the joint posterior distribution. If we calculate (or “monitor”) 1/7 on each iter- 
ation, a Bayesian point estimate of a /t could be calculated as the mean or 


290 MULTIPLE REGRESSION: BAYESIAN INFERENCE 


TABLE 11.1 Body Fat Data 


y x] X2 

11.9 19.5 29.1 
22.8 24.7 28.2 
18.7 30.7 37.0 
20.1 29.8 31.1 
12.9 19.1 30.9 
21.7 25.6 23.7 
27.1 31.4 27.6 
25.4 27.9 30.6 
21.3 22.1 23.2 
19.3 25:5 24.8 
25.4 31.1 30.0 
27.2 30.4 28.3 
11.7 18.7 23.0 
17.8 19.7 28.6 
12.8 14.6 21.3 
23.9 29.5 30.1 
22.6 27.7 25.7 
25.4 30.2 24.6 
14.8 22.7 27.1 
21.1 25.2 27.5 


median of 1/7. A 95% Bayesian interval estimate of o could be computed as the 
central 95% interval of the sample distribution of o*. Other inferences could similarly 
be drawn on the basis of sample draws from the joint posterior distribution. 


Example 11.4. Table 11.1 contains body fat data for a sample of 20 females aged 
25-34 (Kutner et al. 2005, p. 256). The response variable was body fat (y), and 
two predictor variables were triceps skinfold thickness (x,) and midarm circumfer- 
ence (x2). The data were analyzed using the Bayesian multiple regression model of 
Section 11.2.1 with diffuse priors in which @’ = (0, 0, 0), V= 10°L, a= 0.0001, 
and 6=0.0001. Density functions of the marginal posterior distributions of Bo, 
B,, and B, from (11.13) as well as the marginal posterior density of o from 
(11.15) are graphed in Figure 11.1. Superimposed on these (and almost indistinguish- 
able from them) are smooth estimates (Silverman 1999) of the same posterior den- 


sities based on Gibbs sampling with Q = 0 and M = 10,000. 


11.5 POSTERIOR PREDICTIVE INFERENCE 


As a final aspect of Bayesian inference for the multiple regression model, we consider 
Bayesian prediction of the value of the response variable for a future individual. If we 
again use the Bayesian multiple regression model of Section 11.2.1 in which y|, 7 is 
N,(XB,7'D, Blt is Neii(b,77'V), and 7+ is gamma(a, 5), the posterior 
predictive density for a future observation yo with predictor variables xp can be 


11.5 POSTERIOR PREDICTIVE INFERENCE 291 


i : 
Bo By 
T T T T : 
20 -10 0 10 20 0.5 1.0 15 
— posterior density 
Bo o —— approximate posterior 
density using Gibbs 
sampling 
T T T 0 
0 205 0.0 0 10 ee ” 


Figure 11.1 Posterior densities of parameters for the fat data in Table 11.1. 


expressed using (11.5) as 


r(yoly) = | | fe | gvolB, De(B. ry)dB dr 
0 —0o —oo 


co ©# [o.e} 
= c| | on | ql 2 9-0 Xo BY /2 7 (en tk+D)/2 
0 —oo —o0o 


x eT MB-b)'V,B $.)+3u1/24B ae 


+ (y0 — XB) + Su rag. 


Further analytical progress with this integral is difficult. Nonetheless, Gibbs 


sampling as in Section 11.4 can be easily extended to simulate the posterior predictive 
distribution of yo as follows: 


¢ Specify a starting value 7) [possibly 1/ s* from (7.23)]. 
« For i=1 to M: draw £B, from Nui(@,,7 Vas draw 7; from 


gamma{(a,, + k + 3)/2, [(B; — 6, V'(B; — ,) + 6,..]/2}, draw yo; from 
NB; 7 ')- 


292 MULTIPLE REGRESSION: BAYESIAN INFERENCE 


Yo 


5 15 25 


Figure 11.2 Approximate posterior predictive density using Gibbs sampling for a future 
observation yo with xp = (1, 20, 25) for the fat data in Table 11.1. 


¢ Discard the first Q draws (as burn-in), and consider the last M — Q draws of yo; 
to be draws from the posterior predictive distribution. 


Example 11.5. Example 11.4(continued). Consider a new individual with x, = 20 
and x, = 25. Thus xp =(1, 20, 25). Figure 11.2 gives a smooth estimate of 
the posterior predictive density of yo based on Gibbs sampling with Q = 0 and 
M = 10,000. 


The approximate Bayesian 95% prediction interval derived from this density is 
(11.83, 20.15), which may be compared to the 95% prediction interval (10.46, 
21.57) for the same future individual using the non-Bayesian approach (8.62). 

This chapter gives a small taste of the calculations associated with the modern 
Bayesian multiple regression model. With very little additional work, many aspects 
of the model can be modified and customized, especially if the MCMC approach 
is used. Versatility is one of the great advantages of the Bayesian approach. 


PROBLEMS 
11.1 As in Theorem 11.2a, show that (B— VV '(B—) +(y— XB)’ 


(y — XB)+ 6. =(B— ,)'V,(B— ,)+ 6.4, where V.=(V-1+X’X) |, 


11.2. As used in the proof to Theorem 11.2b, show that 


| fed =b Ta + 1). 
0 


11.3 


11.6 


11.10 


PROBLEMS 293 


(a) Show that (I+ XVX’)"! =I — X(X/X4+ V7!) 'X’. 
(b) Show that (I+ XVX’)!X = X(X/X+ V7!)-! vl. 
(c) Show that V-! — V-'(X’X + V-!)y-! Vv"! = x’d+ XVX’) |X. 


As in the proof to Theorem 11.2b, show that y'y + #’V-'o— dl V,'6, 
= (y — X@)'(I+ XVX’)'(y—Xq@), where V. =(X’X+V~!)' and 
p, = V.(X'y+V'¢). 


Consider the Bayesian multiple linear regression model in which y|B, 7 is 
N,(XB, 7 !1), B is uniform (R**’) [i.e., uniform over (k + 1) -dimensional 
space], and In(v_') is uniform (— ©, 00). Show that the marginal posterior 
distribution of Bly is the multivariate ¢ distribution with parameters 
[n — k — 1, B,s?(X'X)~'], where B and s* are defined in the usual way 
[see (7.6) and (7.23)]. These prior distributions are called improper priors 
because uniform distributions must be defined for bounded sets of values. 
Nonetheless, the sets can be very large, and so we can proceed as if they 
were unbounded. 


Consider the augmented data vector ( ,) with mean vector (, ) Band 
covariance matrix k+1 


1 

-I oO 
1 

O -V 
T 


Show that the generalized least-squares estimator of 6 is the Bayesian estima- 
tor in (11.16), (V-! + X'X) '(V-'@4+ X’y). 

Given that 7 is gamma(a, 6) as in (11.7), find E(7) and var(7). 

Use the Bayesian multiple regression model in which y|B,7 is 


N, (XB, 7 '1), Blt is Nei1(, 7 'V), and 7 is gamma(a, 4). Derive the mar- 
ginal posterior density function for o7|y, where o~ = 1/r. 


Consider the Bayesian simple linear regression model in which y;|Bpo, B, 
is N(B)+B,xi,1/7) for i=1,...,n, Bolt is N(a,o3/7), B,|T is 
N(b, 07/7), cov(Bo, B;|7) = O12, and 7 is gamma(a, 4). 

(a) Find the marginal posterior density of B,|y. (Do not simplify the results.) 
(b) Find Bayesian point and interval estimates of B,. 

Consider the Bayesian multiple regression model in which y|f, 7 is 
N, (XB, T'D, B is Niii(, V), and 7 is gamma(a, 5). Note that this is 


similar to the model of Section 11.2 except that the prior distribution of B 
is not conditional on T. 


(a) Find the joint posterior density of B, 7|y up to a normalizing constant. 


294 


11.11 


11.12 


MULTIPLE REGRESSION: BAYESIAN INFERENCE 


(b) Find the conditional posterior density of B|7, y up to a normalizing 
constant. 


(c) Find the conditional posterior density of 7|B, y up to a normalizing 
constant. 


(d) Develop a Gibbs sampling procedure for estimating the marginal pos- 
terior distributions of Bly and (1/7)ly. 

Use the land rent data in Table 7.5. 

(a) Find 95% Bayesian confidence intervals for B;, Bo, and B3 using (11.19) 
in connection with the model in which y|B, T is N,(XB, 7 'D, Blr is 
Nuwi(h, 7 !V), and 7 is gamma(a,d), where @=0,V= 
1001, a = .0001, and 6= .0001. 


(b) Repeat part (a), but use Gibbs sampling to approximate the confidence 
intervals. 


(c) Use Gibbs sampling to obtain a 95% Bayesian posterior prediction inter- 
val for a future individual with xj = (1, 15, 30, .5). 


(d) Repeat part (b), but use the model in which 


y|B, 7 is N,(XB, 7 'D, 
B is Nei, V), 
T is gamma(a, 4) (11.20) 
where & = 0, V = 1001, a = 0.0001, and 6 = 0.0001. 


As in Section 11.5, show that 


| | ws | 71/29 WKB) /2 ghee FEAD/2. 9 AB-B.IV(B-$.)4 301/24 B dr 
Q —0 —0o 


me | B= BV ~ 6.) 00-8? + BI Pap 


—oo 


12 Analysis-of- Variance Models 


In many experimental situations, a researcher applies several treatments or treatment 
combinations to randomly selected experimental units and then wishes to compare 
the treatment means for some response y. In analysis-of-variance (ANOVA), we 
use linear models to facilitate a comparison of these means. The model is often 
expressed with more parameters than can be estimated, which results in an X 
matrix that is not of full rank. We consider procedures for estimation and testing 
hypotheses for such models. 

The results are illustrated using balanced models, in which we have an equal 
number of observations in each cell or treatment combination. Unbalanced models 
are treated in more detail in Chapter 15. 


12.1 NON-FULL-RANK MODELS 


In Section 12.1.1 we illustrate a simple one-way model, and in Section 12.1.2 we 
illustrate a two-way model without interaction. 


12.1.1 One-Way Model 


Suppose that a researcher has developed two chemical additives for increasing the 
mileage of gasoline. To formulate the model, we might start with the notion that 
without additives, a gallon yields an average of w miles. Then if chemical | is 
added, the mileage is expected to increase by 7, miles per gallon, and if chemical 
2 is added, the mileage would increase by 7 miles per gallon. 

The model could be expressed as 


MHeMtN+E, Y2=M+R+ €2, 


where y, is the miles per gallon from a tank of gasoline containing chemical | and ¢; 
is a random error term. The variables y. and €> are defined similarly. The researcher 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


295 


296 ANALYSIS-OF-VARIANCE MODELS 


would like to estimate the parameters pz, 7;, and 7) and test hypotheses such as 
Ho 7 TT] = 72. 

To make reasonable estimates, the researcher needs to observe the mileage per 
gallon for more than one tank of gasoline for each chemical. Suppose that the exper- 
iment consists of filling the tanks of six identical cars with gas, then adding chemical 
1 to three tanks and chemical 2 to the other three tanks. We can write a model for each 
of the six observations as follows: 


YIFRrNTén, YV2=RTMNT 12, Yi3 =MT TMT €13, 


ya =bM+m+61, yo=M+mM+éy, y3= M+ T2 + €93, (12.1) 


or 
y= etaAt+ej;, t= 1,2, j=1,2,3 (12.2) 
where y;; is the observed miles per gallon of the jth car that contains the ith chemical 


in its tank and ¢;; is the associated random error. The six equations in (12.1) can be 
written in matrix form as 


Yu 1 1 0 E11 
ye 1 1 0 iz £12 
y3 | _f 1 1 O £13 
yy | | 1 0 1 a + ae (12.3) 
y22 101 : £22 
Y23 1 0 1 £23 
or 
y=XfP+e. 


In (12.3), X is a6 x 3 matrix whose rank is 2 since the first column is the sum of 
the second and third columns, which are linearly independent. Since X is not of full 
rank, the theorems of Chapters 7 and 8 cannot be used directly for estimating 
B = (wu, 7,72) and testing hypotheses. Thus, for example, the parameters pw, 71, 
and 7 cannot be estimated by B = (X’X)~!X’y in (7.6), because (X’X)~! does not exist. 

To further explore the reasons for the failure of (12.3) to be a full-rank model, let 
us reconsider the meaning of the parameters. The parameter was introduced as the 
mean before adding chemicals, and 7, and 72 represented the increase due to chemi- 
cals 1 and 2, respectively. However, the model yy = w+ 7 + ej in (12.2) cannot 
uniquely support this characterization. For example, if w = 15, 7, = 1, and 7 = 3, 
the model becomes 


yy=IS+1t+ejy=l6+e;, j=1,2,3, 


(12.4) 
y2j) = 15+3+ €25 = 18 + 27, J = 1, 2,3. 


12.1 NON-FULL-RANK MODELS 297 


However, from yj; = 16+); and y2; = 18+ €;, we cannot determine that 
f= 15, | = 1, and 7 = 3, because the model can also be written as 


Yj = 10+ 6+ Elj» J= 1,2, 3, 
y2j = 10+8+ E2j, j= 1, 2,3, 


or alternatively as 


Yj = 25—-—9+ Elj, J= 1,2, 3, 
yo = 25—-T+e;, j= 1,2,3, 


or in infinitely many other ways. 

Thus in (12.1) or (12.2), ws, 7, and 72 are not unique and therefore cannot be esti- 
mated. With three parameters and rank(X) = 2, the model is said to be overpara- 
meterized. Note that increasing the number of observations (replications) for each 
of the two additives will not change the rank of X. 

There are various ways—each with its own advantages and disadvantages—to 
remedy this lack of uniqueness of the parameters in the overparameterized model. 
Three such approaches are (1) redefine the model using a smaller number of new par- 
ameters that are unique, (2) use the overparameterized model but place constraints on 
the parameters so that they become unique, and (3) in the overparameterized model, 
work with linear combinations of the parameters that are unique and can be unam- 
biguously estimated. We briefly illustrate these three techniques. 


1. To reduce the number of parameters, consider the illustration in (12.4): 
y= 16+ Ej and yj = 18 + E2). 


The values 16 and 18 are the means after the two treatments have been applied. 
In general, these means could be labeled 4, and ww, and the model could be 
written as 


Y= My +ey and yoj = My + Ej. 


The means j2, and p2, are unique and can be estimated. The redefined model for 
all six observations in (12.1) or (12.2) takes the form 


Yu 1 0 E11 
y12 1 0 €12 
ya] _ {1 0 Mr), | &13 
y21 0 1 i) éo1 |’ 
y22 0 1 £22 
Y23 0 1 £73 


298 


ANALYSIS-OF-VARIANCE MODELS 


which we write as 


y=Wpe-+ €. 


The matrix W is full-rank, and we can use (7.6) to estimate pr as 


p= (i) = (Ww) Wy. 
M2 


This solution is called reparameterization. 


. An alternative to reducing the number of parameters is to incorporate con- 


straints on the parameters pu, 7T,, and 72. We denote the constrained parameters 
as ww", T;, and 75. In (12.1) or (12.2), the constraint 7; + 7; = 0 has the specific 
effect of defining j2* to be the new mean after the treatments are applied and 7; 
and 7; to be deviations from this mean. With this constraint, yj; = 16 + €); and 
y2j = 18 + €2; in (12.4) can be written only as 


yy=17-1+ 6, ya) = 174+ 1+ €9;. 


This model is now unique because there is no other way to express it so that 
7, + 7 =0. Such constraints are often called side conditions. The model 
yy = ew +7; + ey subject to 7; +7; =0 can be expressed in a full-rank 
format by using 7;=-—7T, to obtain yjy=u* +7} +e, and 
y2j = w* — 7 + &€;. The six observations can then be written in matrix form as 


Yu 1 1 E11 
Y12 1 1 E12 
ial (eee 1 on 4 | 3 
y21 1 -l Tj £21 
y22 1 —-l £22 
Y23 1 —-l £73 
or 
y=X poe. 


The matrix X* is full-rank, and the parameters * and 7; can be estimated. 
It must be kept in mind, however, that specific constraints impose specific defi- 
nitions on the parameters. 


. AS we examine the parameters in the model illustrated in (12.4), we see some 


linear combinations that are unique. For example, 7; — 7 = —2, w+ 7 = 16, 
and + 7 = 18 remain the same for all alternative values of w, 7), and 7. 
Such unique linear combinations can be estimated. 


12.1 NON-FULL-RANK MODELS 299 


In the following example, we illustrate these three approaches to parameter defi- 
nition in a simple two-way model without interaction. 


12.1.2. Two-Way Model 


Suppose that a researcher wants to measure the effect of two different vitamins and 
two different methods of administering the vitamins on the weight gain of chicks. 
This leads to a two-way model. Let a; and a be the effects of the two vitamins, 
and let 8B; and B> be the effects of the two methods of administration. If the researcher 
assumes that these effects are additive (no interaction; see the last paragraph in this 
example for some comments on interaction), the model can be written as 


yYrHe+at+Byt+en, yos=etat+p,t+en, 
ya =bh+a.+B) +821, yo = Mt ay + B+ &2, 


or as 


Yj= MEM B+ ej, 1=1,2, 7=1,2, Ge) 


where y,; is the weight gain of the ijth chick and ¢;; is a random error. (To simplify 
exposition, we show only one replication for each vitamin—method combination.) 
In matrix form, the model can be expressed as 


Yili 1 1 0 1 0 - Ell 
yo} [1 1001 : £12 
War Mo ele Oe AA tO . a eon (2.6) 
y22 1 0 1 0 1 B, £22 
or 
y=XB+e. 


In the X matrix, the third column is equal to the first column minus the second 
column, and the fifth column is equal to the first column minus the fourth column. 
Thus rank(X) = 3, and the 5 x 5 matrix X’X does not have an inverse. Many of 
the theorems of Chapters 7 and 8 are therefore not applicable. Note that if there 
were replications leading to additional rows in the X matrix, the rank of X would 
still be 3. 

Since rank(X) = 3, there are only three possible unique parameters unless side 
conditions are imposed on the five parameters. There are many ways to reparameter- 
ize in order to reduce to three parameters in the model. For example, consider the par- 
ameters y;, Y2, and y3 defined as 


yy =eM+ayt+B,, y.=-—a1, 3 = B— B,. 


300 ANALYSIS-OF-VARIANCE MODELS 


The model can be written in terms of the y terms as 


yu =(ta+P)+en =y+ en 

yin = (wt a + By) + (By — BY) + €2 = + ¥3 + F12 

yar = (w+ ay + By) + (a2 — a1) + €21 = Y + Yo + E21 

yor = (M+ a + By) + (a2 — a1) + (By — By) + 22 = H+ Y2 + 3 + €22- 


In matrix form, this becomes 


Yu 1 0 0 El] 
yo) [1 01)/™ | er 
yr} [1 1 0} | £01 
yn xD, PEN £7 
or 
y=Zy+e. (12.7) 


The rank of Z is clearly 3, and we have a full-rank model for which y can be esti- 
mated by y=(Z'Z)'Z’y. This provides estimates of y, =a )— a, and 
3 = PB» — P;, which are typically of interest to the researcher. 

In Section 12.2.2, we will discuss methods for showing that linear functions such 
as w+ a;,+ By, & —a,, and PB, — B, are unique and estimable, even though 
LM, 1, Q2, By, By are not unique and not estimable. 

We now consider side conditions on the parameters. Since rank(X) = 3 and there 
are five parameters, we need two (linearly independent) side conditions. If these 
two constraints are appropriately chosen, the five parameters become unique and 
thereby estimable. We denote the constrained parameters by uu", a, and B; and con- 
sider the side conditions aj + a5 =0 and 6} + 6; =0. These lead to unique 
definitions of a} and B; as deviations from means. To show this, we start by 


writing the model as 


Yu= My TE, Yi2 = Mig + 12, 


Y21 = Moi + €21, 22 = Mog + €22, 


(12.8) 


where j2,, = E( yj) is the mean weight gain with vitamin i and method j. The means 
are displayed in Table 12.1, and the parameters aj, a3, 8}, B> are defined as row (a) 
and column (B) effects. 

The means in Table 12.1 are defined as follows: 


ae tae jt; = a= es My eee + M22 


12.2 ESTIMATION 301 


TABLE 12.1 Means and Effects for the Model in (12.8) 
Columns (£) 


Rows (a) 1 2 Row Means Row Effects 

Row | Mi Hi2 My. ay = fy — B, 

Row 2 P21 boo Bo, Oy = fly, — BL, 

Column By p> 7 —— 
means 

Column Bi =hi-B, By = Ba — B. ag os, 
effects 


The first row effect, a} = [,. — [., is the deviation of the mean for vitamin | from 
the overall mean (after treatments) and is unique. The parameters a5, 8}, and B5 are 
likewise uniquely defined. From the definitions in Table 12.1, we obtain 


ay + 03 = Ma, ra + My, — B= By + By, — 2, (12.9) 
=2p,— 2m, =0, 
and similarly, 6; + 6; =0. Thus with the side conditions aj +a;=0 and 
B + B> = 9, the redefined parameters are both unique and interpretable. 
In (12.5), it is assumed that the effects of vitamin and method are additive. To 
make this notion more precise, we write the model (12.5) in terms of 
B= bh, oF = W;,— p, and B = pj — p.: 


b (i, — Bh) + (By — B+ ey — By — By t+ BL) 


My = B., 
= pl + of + BF. 


The term ww; — 4, — &; + &, which is required to balance the equation, is associ- 
ated with the interaction between vitamins and methods. In order for a and f; to 
be additive effects, the interaction w — f@; — @; + &, must be zero. Interaction 
will be treated in Chapter 14. 


12.2. ESTIMATION 


In this section, we consider various aspects of estimation of B in the non-full-rank 
model y = XB + €. We do not reparameterize or impose side conditions. These 
two approaches to estimation are discussed in Sections 12.5 and 12.6, respectively. 
Normality of y is not assumed in the present section. 


302 ANALYSIS-OF-VARIANCE MODELS 


12.2.1 Estimation of 6 


Consider the model 

y=XB +, 
where E(y) = XB, cov(y) = o° I, and X is n x p of rank k < p <n. [We will say 
“X is nx p of rank k <p <n” to indicate that X is not of full rank; that is, 
rank(X) < p and rank(X) <n. In some cases, we have k <n < p.] In this non- 
full-rank model, the p parameters in B are not unique. We now ascertain whether 


B can be estimated. 
Using least-squares, we seek a value of 6 that minimizes 


#® = (y — XB)(y — XB). 
We can expand é’é to obtain 
é@ = y'y — 2f'X'y + B’X'XB, (12.10) 


which can be differentiated with respect to B and set equal to 0 to produce the familiar 
normal equations 


X'XB = X’y. (12.11) 


Since X is not full rank, X’X has no inverse, and (12.11) does not have a unique 
solution. However, X’XB = X’y has (an infinite number of) solutions: 


Theorem 12.2a. If X is nx p of rank k<p<n, the system of equations 
X'XB = X’y is consistent. 


Proor. By Theorem 2.8f, the system is consistent if and only if 
X’/X(X'X)-X’y = X’y, (12.12) 
where (X'X)~ is any generalized inverse of X'X. By Theorem 2.8c(iii), X’X 


(X'X)" X’ = X’, and (12.12) therefore holds. (An alternative proof is suggested in 
Problem 12.3.) 


Since the normal equations X’ XB = X’y are consistent, a solution is given by 
Theorem 2.8d as 


B= (XX) X’y, (12.13) 


12.2 ESTIMATION 303 


where (X’X)~ is any generalized inverse of X’X. For a particular generalized 
inverse (X’X)~, the expected value of B iS 


E(B) = (X'X)X’E(y) 
= (X’X) X’XB. (12.14) 


Thus, is an unbiased estimator of (X’X)~ X’XB. Since (X'X)~X’'X 4 I, Bis notan 
unbiased estimator of 8. The expression (X’X)~ X'X is not invariant to the choice of 
(X’X)~; that is, E(B) is different for each choice of (X’X)~. [An implication in 
(12.14) is that having selected a value of (X'X)~, we would use that same value of 
(X'X)~ in repeated sampling. ] 

Thus, B in (12.13) does not estimate B. Next, we inquire as to whether there are 
any linear functions of y that are unbiased estimators for the elements of ; that is, 
whether there exists a p x n matrix A such that E(Ay) = B. If so, then 


B = E(Ay) = E[A(XB + €)] = E(AXB) + AE(e) = AXB. 


Since this must hold for all B, we have AX = I, [see (2.44)]. But by Theorem 2.4(i), 
rank(AX) < p since the rank of X is less than p. Hence AX cannot be equal to [,,, and 
there are no linear functions of the observations that yield unbiased estimators for the 
elements of B. 


Example 12.2.1. Consider the model yj = w+ 7+ ey; i= 1,2; j= 1,2,3 in 
(12.2). The matrix X and the vector B are given in (12.3) as 


1 1 0 

1 1 0 ‘i 
—_}/1 1 £0 _ 
x= 1 0 1 > B= TT 

101 e: 

10 1 


By Theorem 2.2c(i), we obtain 
6 3 3 
xXxX=[3 3 0 
3:7.:0' -3 
By Corollary 1 to Theorem 2.8b, a generalized inverse of X’X is given by 


(X'X)- = 


ooo 
Our © 
we CO © 


304 ANALYSIS-OF-VARIANCE MODELS 


The vector X’y is given by 


Jil 
Looe A Sh ES ae ee y. 
Xy=/1 1100 0)/7®|=[y, |, 
G0: Oe | Ye, 
y22 
y23 
where y. = yy eS yy and y;, = ee yij- Then 
; 0 0 0\ /y. 0 
B=(XX)Xy=|0 | vw J = | M. |- 
0.0 3 y2. Yo, 


where y; = ae yi /3 = y./3. 
To find E(B), we need E(y;.). Since E(e) = 0, we have E(e;;) = 0. Then 


3 
Ey) = e( S203) = rea E(yij) 
j=l 
=1305_) Kut 1+ 2) =4 Gu + 37; +0) 
= e+ Tj. 
Thus 
0 


E(B) = | wt+n 
b+ % 


The same result is obtained in (12.14): 


E(B) = (X'X)-X’XB 


12.2 ESTIMATION 305 


12.2.2 Estimable Functions of 8 


Having established that we cannot estimate B, we next inquire as to whether we can 
estimate any linear combination of the B’s, say, A’ B. For example, in Section 12.1.1, 
we considered the model yj = w+ 7; + €j, i= 1,2, and found that w, 7, and 7) in 
B= (wu, 71, 72)’ are not unique but that the linear function 7, — 7 = (0,1, —1)B is 
unique. In order to show that functions such as 7; — 72 can be estimated, we first give 
a definition of an estimable function A’. 

A linear function of parameters A’ B is said to be estimable if there exists a linear 
combination of the observations with an expected value equal to A’; that is, A’B is 
estimable if there exists a vector a such that E(a’y) = A’B. 

In the following theorem we consider three methods for determining whether a 
particular linear function A’ is estimable. 


Theorem 12.2b. In the model y= Xf+ €, where E(y) = XB and X is n x p 
of rank k < p <n, the linear function A’B is estimable if and only if any one of 
the following equivalent conditions holds: 


(i) A’ is a linear combination of the rows of X; that is, there exists a vector a such 
that 


aX =N'. (12.15) 


(ii) A’ is a linear combination of the rows of X’X or A is a linear combination of 
the columns of X’X, that is, there exists a vector r such that 


rX’X=N or X’Xr=A. (12.16) 


(iii) A or A’ is such that 
X’/X(X/X)-A=A or A(X/X) XK =A, (12.17) 


where (X'X)~ is any (symmetric) generalized inverse of X’X. 


Proor. For (ii) and (111), we prove the “if” part. For (i), we prove both “if’ and “only 
if.” 


(i) If there exists a vector a such that A’ = a’X, then, using this vector a, we have 


E(a'y) = a'E(y) = a XB = A'B. 


Conversely, if A’B is estimable, then there exists a vector a such that 
E(a'y) = A'B. Thus a’'XB = A'B, which implies, among other things, that 
aX = 2. 


306 ANALYSIS-OF-VARIANCE MODELS 


(ii) If there exists a solution r for X’Xr = A, then, by defining a = Xr, we obtain 
E(a’'y) = E@'X’'y) = r'X'E(y) 
=1rX'XB =d)'B. 


(iii) If X'X(X'X)"A = A, then (X’X)~A is a solution to X'Xr = A in part(ii). (For 
proof of the converse, see Problem 12.4.) 


We illustrate the use of Theorem 12.2b in the following example. 


Example 12.2.2a. For the model yy =w+7+ej; i=1,2; j=1,2,3 in 
Example 12.2.1, the matrix X and the vector B are given as 


1 1 0 
1 1 0 i 
1 1 0 

X= 1 0 1 > B= Ty 
b 04 se 
101 


We noted in Section 12.1.1 that 7, — 7 is unique. We now show that 7; — 7 = 
(0, 1, —1)B = A’B is estimable, using all three conditions of Theorem 12.2b. 


(i) To find a vector a such that a’X=A’=(0,1,—1), consider a’ = 
(0,0, 1, —1,0,0), which gives 


aX = (0, 0, 1, _ 1,0, O)X _ d, 1,0) _ (1,0, 1) 
=(0,1,-1) =’. 
There are many other choices for a, of course, that will yield a’X = A’, for 


example a’ = (1,0,0,0,0, —1) or a’ = (2, —1,0,0, 1, —2). Note that we can 
likewise obtain A'B from E(y): 


A B = a'XB = a'E(y) = (0,0, 1, —1, 0, O)E(y) 


E(yi) 
E(yi2) 
E(y13) 
E(ya1) 
E(y22) 
E(y23) 


= E(yi3) — E(yai) = M+ 71 — (Ut 2) = T1 — T2- 


_ (0, 0, 1, —1,0,0) 


12.2 ESTIMATION 307 


(ii) The matrix X’X is given in Example 12.2.1 as 


6 3 
xXX= [3 3 
3 0 


wow 


To find a vector r such that X’Xr=A=(0,1,—1)’, consider 


r = (0, 3, —4)', which gives 


6-3 3 0 0 
X’Xr=[3 3 0 z;/={ 1]=a 
ey ee = 


There are other possible values of r, of course, such as r = (-3, 3, Oy’. 


(iti) Using the generalized inverse (X’X) = diag(0, i, 5) given in Example 
12.2.1, the product X’X(X'X)~ becomes 


ea | 
x’x(x’x) =[0 1 0 
001 


Then, for A = (0, 1, —1)’, we see that X’X(X’X)~A = A in (12.17) holds: 


011 0 0 
0 1 0 1]= 1 
0 0 1 —1 —1 


A set of functions A‘, B, A, B,..., rj,,B is said to be linearly independent if the 
coefficient vectors A;, A2,..., A are linearly independent [see (2.40)]. The 
number of linearly independent estimable functions is given in the next theorem. 


Theorem 12.2c. In the non-full-rank model y = Xf + e, the number of linearly 
independent estimable functions of B is the rank of X. 


Proor. See Graybill (1976, pp. 485-486). 


From Theorem 12.2b(i), we see that x‘ is estimable for i = 1,2, ...,n, where x, 
is the ith row of X. Thus every row (element) of X is estimable, and XP itself can be 
said to be estimable. Likewise, from Theorem 12.2b(ii), every row (element) of X’XB 
is estimable, and XX is therefore estimable. Conversely, all estimable functions can 
be obtained from XB or X’XP: 

Thus we can examine linear combinations of the rows of X or of X’X to see what 
functions of the parameters are estimable. In the following example, we illustrate the 


308 ANALYSIS-OF-VARIANCE MODELS 


use of linear combinations of the rows of X to obtain a set of estimable functions of 
the parameters. 


Example 12.2.2b. Consider the model in (12.6) in Section 12.1.2 with 


11010 A 
a! SOCTOs a : 
Uh 2.1 “a gap 2k 
1031041 : 
Po 


To examine what is estimable, we take linear combinations a’X of the rows of X to 
obtain three linearly independent rows. For example, if we subtract the first row of X 
from the third row and multiply by B, we obtain (0 —1 1 0 0)B = —a, + a», which 
involves only the a’s. Subtracting the first row of X from the third row can be 
expressed as a'X = (—10 1 0)X = —x) + x4, where x’ and x4 are the first and 
third rows of X. 

Subtracting the first row from each succeeding row in X gives 


1 1 0 1 0 
0 00 -!1 I 
0 -1 1 0 0 
0 -1 1 -1 1 


1 1 0 0 
0 00 -1 1 
0 -1 1 0 0 
0 0 oO 0 


Multiplying the first three rows by B, we obtain the three linearly independent esti- 
mable functions 


B= utart+ Bi; ABB=B,—By, AB= a2 - a4. 


These functions are identical to the functions y,, y2, and y3 used in Section 12.1.2 to 
reparameterize to a full-rank model. Thus, in that example, linearly independent esti- 
mable functions of the parameters were used as the new parameters. 


In Example 12.2.2.b, the two estimable functions B, — 8, and a2 — a, are such 
that the coefficients of the B’s or of the a’s sum to zero. A linear combination of 
this type is called a contrast. 


12.3 ESTIMATORS — 309 
12.3. ESTIMATORS 


12.3.1 Estimators of A’B 


From Theorem 12.2b(i) and (ii) we have the estimators a’y and r’X’y for A’ B, where 
a’ and r’ satisfy A’ = a’X and A’ = r’X’X, respectively. A third estimator of A’ B is 
NB, where B is a solution of X/XP = X’y. In the following theorem, we discuss 


some properties of r’X'y and NB. We do not discuss the estimator a’y because it 
is not guaranteed to have minimum variance (see Theorem 12.3d). 


Theorem 12.3a. Let A’B be an estimable function of B in the model y = XB + «, 
where E(y) = XB and X is n x p of rank k < p <n. Let B be any solution to the 
normal equations X’XP = X’y, and let r be any solution to X/Xr = A. Then the two 
estimators A’ B and r’X’y have the following properties: 


(i) E(B) = E(r’X'y) = YB. 
(ii) NB is equal to r’X’y for any B or any r. 
(iii) NB and r’X’y are invariant to the choice of B or r. 


PROOF 


(i) By (12.14) 
E(A'B) = N'E(B) = A'(X'X) X’XB. 
By Theorem 12.2b(iii), A/(X’X)~ X’X = 2’, and E(A’B) becomes 
E(A'B) = A'B. 
By Theorem 12.2b(ii) 


E(’X'y) = r'X’E(y) = r'X'XB = A'B. 


(ii) By Theorem 12.2b(ii), if A’ B is estimable, A’ = r’X’X for some r. Multiplying 
the normal equations X’/XB = X’y by r’ gives 
rX'XB = r'X’y. 
Since r’X’X = 2’, we have 
NB=rX’y. 
(iii) To show that r’X’y is invariant to the choice of r, let r, and r be such that 


X'Xr; = X'Xr. = A. Then 


rX’XB=r{X'y and r4X’XB=r)X’y. 


310 ANALYSIS-OF-VARIANCE MODELS 


Since r',X'X = rX’X, we have r' X’y = r5X’y. It is clear that each is equal to 
Nv B. (For a direct proof that A’ B is invariant to the choice of B. see Problem 12.6.) 


We illustrate the estimators r’X’y and AB in the following example. 


Example 12.3.1. The linear function A’B = 7; — 7 was shown to be estimable in 
Example 12.2.2a. To estimate 7, — 7 with r’X’y, we use r’ = (0, i, — +) from 
Example 12.2.2a to obtain 


Yul 
i aah, ae a 
rx'y=(0,4,-)]1 1 100 0]]°” 
y= 2 3°? 3 ya 
000111 
y22 
23 
y. rar 
12 2: = = 
= (0, 3, - 3) y1 3 3! Y2.> 
y2. 


2 3 3 = 3 
where y= j- re Vij Vi. = eget Jij> and 9, = yi.,/3 = j=1 94/3. 
To obtain the same result using A’'B, we first find a solution to the normal 
equations X'XB = X’y 


6.35 BN ) 
3 3 0 7 )= |v. 
3 0 3 7 Yo. 
or 
6 +3%1+3m=y, 
3+ 37 =. 
3fL + 37 = yo. 


The first equation is redundant since it is the sum of the second and third equations. 
We can take j to be an arbitrary constant and obtain 
Aa 1 A A 


7 =3V1.-BR=V.- Bh, % = 3y2,-h=Jo,- fp. 


Thus 


T Yo, =I 


12.3. ESTIMATORS 311 


To estimate 7;—7—=(0,1,-1)B=A'B, we can set f& =0 to obtain 
B= (0,¥,,,¥2) and A'B = y, — yy. If we leave fi arbitrary, we likewise obtain 


(er 
NB=(0,1,-)| ¥,.- 4 
Yo — pb 


— (92, -— B) = 91, — Jo,- 


> 


yin 


Since B = (X’X) X’y is not unique for the non-full-rank model y = XB + «€ with 
cov(y) = o°I, it does not have a unique covariance matrix. However, for a particular 
(symmetric) generalized inverse (X'X)~, we can use Theorem 3.6d(i) to obtain the 
following covariance matrix: 


cov() = cov[(X’X)-X’y] 
= (X/X)- X'(ePDX[(X’X) | 
= 0° (X'X)~X’X(X’X)-. (12.18) 


The expression in (12.1 8) is not invariant to the choice of (X’X)~. 
The variance of A’B or of r'X’y is given in the following theorem. 


Theorem 12.3b. Let A’ be an estimable function in the model y = XB + €, where 
X is n x p of rank k < p < nand cov(y) = o° I. Let r be any solution to X’/Xr = A, 
and let B be any solution to X’XP = X’y. Then the variance of J’ B or r’X’y has the 
following properties: 


(i) var(r’X’y) = o7r'X'Xr = oA. 
(ii) var(A' B) = oN (XX) A. 
(iii) var(A’ B) is unique, that is, invariant to the choice of r or (X’X)~. 


PROOF 
(i) var(r’ X’y) = r'X'cov(y)Xr [by (3.42)] 
=rX'(0°DXr = or’ X'Xr 
=ord. [by (12.16)]. 
(ii) var(A' B) = A'cov(B)A 


= oN (X'X)-X’X(X'X)-A [by (12.18)]. 


312 


(iii) 


ANALYSIS-OF-VARIANCE MODELS 
By (12.17), A’(X’X)_ X’X = NX’, and therefore 
var(A’ B) = 0° A'(X'X) A. 


To show that r’A is invariant to r, let r, and r, be such that X’Xr; = A and 
X’Xr2 = A. Multiplying these two equations by r, and r, we obtain 


rX’Xr,) =rjA and rX’Xrm =r‘ A. 


The left sides of these two equations are equal since they are scalars and are 
transposes of each other. Therefore the right sides are also equal: 


(a a 
rA=r,A. 


To show that A’(X’X)~ A is invariant to the choice of X’X~, let G; and G, be 
two generalized inverses of X’X. Then by Theorem 2.8c(v), we have 


XG_X' = XG)X’. 


Multiplying both sides by a such that a’X = A’ [see Theorem 12.2b(i)], we 
obtain 


a'XG,X’a = a’XG>X‘a, 
NGA=NGoA. 


The covariance of the estimators of two estimable functions is given in the follow- 
ing theorem. 


Theorem 12.3c. If A, and A,B are two estimable functions in the model 
y = XB + €, where X is n x p of rank k < p < nand cov(y) = o°I, the covariance 
of their estimators is given by 


cov(A) B, A,B) = o'r, Ap = 0 Airy = 0 Ai (X'X) Ad, 


where X’Xr; = A, and X’Xrp = Ad. 


Proor. See Problem 12.12. 


The estimators NB and r’X’y have an optimality property analogous to that in 
Corollary 1 to Theorem 7.3d. 


12.3. ESTIMATORS 313 


Theorem 12.3d. If A’B is an estimable function in the model y = XB + &, where X 
is n X p of rank k < p <n, then the estimators NB and r’X’y are BLUE. 


Proor. Let a linear estimator of A’ B be denoted by a’y, where without loss of general- 
ity a’y =r'X'y+c’y, that is, a’ = r’X’ + ec’, where r’ is a solution to A’ = r’X’X. For 
unbiasedness we must have 


AB = E(a'y) = a XB =r'X'XB4+ XB = ("XX + XG. 
This must hold for all 8, and we therefore have 
N =9rX’X+ eX. 
Since A’ = r'X’X, it follows that e’X = 0’. Using (3.42) and e’X = 0’, we obtain 


var(a'y) = a’cov(y)a = a’o° Ia = o°a’a 
=o (r'X’ + c)(Xr+ ce) 
=o (rX’Xr +r X’e+-¢Xr4+ cc) 


= 0 (r'X’Xr + cc). 


Therefore, to minimize var(a’y), we must minimize e’c = >, c?. This is a minimum 
when c = 0, which is compatible with ¢e’X = 0’. Hence a’ is equal to r’X’, and the 
BLUE for the estimable function A'B is a'y = r’X’y. 


12.3.2 Estimation of 07 


By analogy with (7.23), we define 
SSE = (y — XB)'(y — XB), (12.19) 


where B is any solution to the normal equations xX’xB = X’y. Two alternative 
expressions for SSE are 


SSE = y'y — P’X’y, (12.20) 


SSE = y’[I — X(X'X) X’y. (12.21) 


314 ANALYSIS-OF-VARIANCE MODELS 


For an estimator of o”, we define 


(12.22) 


where n is the number of rows of X and k = rank(X). 
Two properties of s* are given in the following theorem. 


Theorem 12.3e. For s* defined in (12.22) for the non-full-rank model, we have the 
following properties: 


(i) E(s*) = o. 
(ii) s* is invariant to the choice of B or to the choice of generalized inverse 
(X’X). 
PROOF 


(i) Using (12.21), we have E(SSE) = E{y’[I — X(X'X)" X’Jy}. By Theorem 5.2a, 
this becomes 


E(SSE) = tr{[I — X(X/X)~X'(o°I} + B/X'[I — X(X/X)-X’]XB. 


It can readily be shown that the second term on the right side vanishes. For the 
first term, we have, by Theorem 2.11(i), (ii), and (viii) 


o tr[I — X(X/X)~X’] = o° {tr() — tr[X’X(X’X)"]} 
=(n—ko’, 


where k = rank(X’X) = rank(X). 
(ii) Since XP is estimable, xB is invariant to B [see Theorem 12.3a(iii)], and there- 
fore SSE = (y — XB) (y — XB) in (12.19) is invariant. To show that SSE in 
(12.21) is invariant to choice of (X’X)~, we note that X(X’X) X’ is invariant 
by Theorem 2.8c(v). 


12.3.3 Normal Model 


For the non-full-rank model y = XB + &, we now assume that 
yisN,(XB,0°I) or eis N,(0, oD). 


With the normality assumption we can obtain maximum likelihood estimators. 


12.3. ESTIMATORS 315 


Theorem 12.3f. If y is N,(XB, o°I), where X is n x p of rank k < p <n, then the 
maximum likelihood estimators for B and o~ are given by 


B= (X'XyX’y, (12.23) 


1 n i 
7 =~ (y— XB)(y — XB). (12.24) 


Proor. For the non-full-rank model, the likelihood function L(B, o”) and its logar- 


ithm InL(B, o”) can be written in the same form as those for the full-rank model 
in (7.50) and (7.51): 


I 
L(B, 02) = ————~ e 9- XB 9-XB)/20° D2 
(B, ) (Qn02)" e ‘ ( 5) 
InL(B, 0°) = —FIn Qn) — Fino" sa XB)(y—XB). (12.26) 
Oo 


Differentiation of In L(B, 07) with respect to B and o* and setting the results equal to 
Zero gives 


X’XB = X’y, (12.27) 


1 re R 
= “y - XB yy — XB), (12.28) 


where B in (12.28) is any solution to (12.27). If (X’X)~ is any generalized inverse of 
X’X, a solution to (12.27) is given by 


B = (XX) X’y. (12.29) 


The form of the maximum likelihood estimator B in (12.29) is the same as that of 
the least-squares estimator in (12.13). The estimator & is biased. We often use the 
unbiased estimator 5” given in (12.22). 

The mean vector and covariance matrix for B are given in (12.14) and (12.18) as 
E(B) = (X'X)-X’XB and cov(B) = 02(X'X)~X’X(X’X)-. In the next theorem, we 
give some additional properties of B and s*. Note that some of these follow because 


B = (X'X) X’y is a linear function of the observations. 


316 ANALYSIS-OF-VARIANCE MODELS 


Theorem 12.3g. If y is N,(XB, 071), where X is n x p of rank k < p <n, then the 


maximum likelihood estimators B and s* (corrected for bias) have the following 
properties: 


(i) Bis N,[(X/X)~X'XB, 0°(X’X)~X'X(X’X)]. 
(ii) (n — k)s?/o? is 2(n — b). 


(iii) B and s” are independent. 


Proor. Adapting the proof of Theorem 7.6b for the non-full-rank case yields the 
desired results. 


The expected value, covariance matrix, and distribution of B in Theorem 12.3g are 


2 


valid only for a particular value of (X’X)~, whereas, s“ is invariant to the choice of B 


or (X’X)~ [see Theorem 12.3e(ii)]. 


The following theorem is an adaptation of Corollary | to Theorem 7.6d. 


Theorem 12.3h. If y is N,(XB, 0°), where X is n x p of rank k < p <n, and if A’B 


is an estimable function, then A’B has minimum variance among all unbiased 
estimators. 


In Theorem 12.3d, the estimator 2’ B was shown to have minimum variance among 
all linear unbiased estimators. With the normality assumption added in Theorem 


12.3g, AB has minimum variance among all unbiased estimators. 


12.4 GEOMETRY OF LEAST-SQUARES IN THE 
OVERPARAMETERIZED MODEL 


The geometric approach to least-squares in the overparameterized model is similar to 
that for the full-rank model (Section 7.4), but there are crucial differences. The 
approach involves two spaces, a p-dimensional parameter space and an n-dimensional 
data space. The unknown parameter vector B is an element of the parameter space 
with axes corresponding to the coefficients, and the known data vector y is an 
element of the data space with axes corresponding to the observations (Fig. 12.1). 

The n x p partitioned X matrix of the overparameterized linear model (Section 
12.2.1) is 


X = (Xj, Xz, ..., Xp). 


The columns of X are vectors in the data space, but since rank(X)= k < p, the set of 
vectors is not linearly independent. Nonetheless, the set of all possible linear combi- 
nations of these column vectors constitutes the prediction space. The distinctive 


12.4 GEOMETRY OF LEAST-SQUARES IN THE OVERPARAMETERIZED MODEL 317 


Prediction space 


Parameter space Data space 


Figure 12.1 A geometric view of least-squares estimation in the overparameterized model. 


geometric characteristic of the overparameterized model is that the prediction space is of 
dimension k < p while the parameter space is of dimension p. Thus the product Xu, 
where u is any vector in the parameter space, defines a many-to-one relationship 
between the parameter space and the prediction space (Fig. 12.1). An infinite number 
of vectors in the parameter space correspond to any particular vector in the prediction 
space. 

As was the case for the full-rank linear model, the overparameterized linear model 
states that y is equal to a vector in the prediction space, E(y) = XB, plus a vector of 
random errors ¢. Neither 6 nor € is known. Geometrically, least-squares estimation 
for the overparametrized model is the process of finding a sensible guess of E(y) in 
the prediction space and then determining the subset of the parameter space that is 
associated with this guess (Fig. 12.1). 

As in the full-rank model, a reasonable geometric idea is to estimate E(y) using y, 
the unique point in the prediction space that is closest to y. This implies that the differ- 
ence vector € = y — y must be orthogonal to the prediction space, and thus we seek y 
such that 


xXé=0, 
which leads to the normal equations 


X'XB = X’y. 


However, these equations do not have a single solution since X’X is not full-rank. 
Using Theorem 2.8e(i1), all possible solutions to this system of equations are given 
by B = (X’'X)-X’y using all possible values of (X’X)~. These solutions constitute 
an infinite subset of the parameter space (Fig. 12.1), but this subset is not a subspace. 


318 ANALYSIS-OF-VARIANCE MODELS 


Since the solutions are infinite in number, none of the B values themselves have any 
meaning. Nonetheless, y = XB is unique [see Theorem 2.8c(v)], and therefore, to be 
unambiguous, all further inferences must be restricted to linear functions of xB rather 
than of B. 

Also note that the n rows of X generate a k-dimensional subspace of p-dimensional 
space. The matrix products of the row vectors in this space with B constitute the set of 
all possible estimable functions. The matrix products of the row vectors in this space 
with any B (these products are invariant to the choice of a generalized inverse) con- 
stitute the unambiguous set of corresponding estimates of these functions. 

Finally, € = y — XB = (I— WDy can be taken as an unambiguous predictor of €. 
Since € is now a vector in (n — k)-dimensional space, it seems reasonable to estimate 
a” as the squared length (2.22) of & divided by n — k. In other words, a sensible esti- 
mator of 0° is s? = y'(I — H)y/(n — k), which is equal to (12.22). 


12.5 REPARAMETERIZATION 


Reparameterization was defined and illustrated in Section 12.1.1. We now formalize 
and extend this approach to obtaining a model based on estimable parameters. 

In reparameterization, we transform the non-full-rank model y = XP + e, where 
Xisn x pofrankk < p <n, to the full-rank model y = Zy + €, where Zisn x k of 
rank k and y = UP is a set of k linearly independent estimable functions of 6B. Thus 
Zy = XB, and we can write 


Zy = ZUB = XB, (12.30) 


where X = ZU. Since U is k x p of rank k < p, the matrix UU’ is nonsingular by 
Theorem 2.4(iii), and we can multiply ZU = X by U’ to solve for Z in terms of X 
and U: 


ZUU' = XU’ 
Z=xuU'(UvU)"!. (12.31) 


To establish that Z is full-rank, note that rank(Z) > rank(ZU) = rank(X) = k by 
Theorem 2.4(). However, Z cannot have rank greater than k since Z has k 
columns. Thus rank(Z) = k, and the model y = Zy + € is a full-rank model. We 
can therefore use the theorems of Chapters 7 and 8; for example, the normal equations 
Z'Z¥ = Z'y have the unique solution y = (Z/Z)"'Z’y. 

In the reparameterized full-rank model y = Zy + €, the unbiased estimator of o7 
is given by 


1 ee N 
oe (y — Zy)(y — ZY). (12.32) 
n—k 


12.5 REPARAMETERIZATION 319 


Since Zy = XB, the estimators Z and XB are also equal 
Ly = XB. 
and SSE in (12.19) and SSE in (12.32) are the same: 


(y — XB)'(y — XB) = (y — Z)'(_y — ZY. (12.33) 


The set UB = y is only one possible set of linearly independent estimable func- 
tions. Let VB = 6 be another set of linearly independent estimable functions. Then 
there exists a matrix W such that y = W6 + e. Now anestimable function A’ B can be 
expressed as a function of y or of 6: 


NB=b'y=C6. (12.34) 
Hence 
NB =by=c6, 


and either reparameterization gives the same estimator of A’ B. 


Example 12.5. We illustrate a reparameterization for the model yj = w+ 7+ 
éj, t= 1,2, j= 1,2. In matrix form, the model can be written as 


EP ON as a5 

1 1:6 
PEPE Wy gs a et Ee 
Oty 2 a 


Since X has rank 2, there exist two linearly independent estimable functions (see 
Theorem 12.2c). We can choose these in many ways, one of which is + 7, and 


b+ 7%. Thus 
b 
Pah ie: se ee Gs ae ey ee a _ 
v= (3) = (era) = Co a)[n) =e 


i) 


To reparameterize in terms of y, we can use 


ee OO 


1 
1 
0 
0 


320 ANALYSIS-OF-VARIANCE MODELS 


so that Za = XP: 


1 0 Ni Mr Ty 
1 0}/™ Vi b+ 7 

Z => = = 
is 0 1 ( Y2 ) Y2 M+ 7 
0 1 Y2 Mr 72 


[The matrix Z can also be obtained directly using (12.31).] It is easy to verify 
that ZU = X. 


ZU = 


Fa 
ee 
or 
- © 
Se} 
II 
pS eee 
So. 60 = — 
-eE OO 


coor eS 


12.6 SIDE CONDITIONS 


The technique of imposing side conditions was introduced and illustrated in Section 
12.1 Side conditions provide (linear) constraints that make the parameters unique and 
individually estimable, but side conditions also impose specific definitions on the 
parameters. Another use for side conditions is to impose arbitrary constraints on 
the estimates so as to simplify the normal equations. In this case the estimates 
have exactly the same status as those based on a particular generalized inverse 
(12.13), and only estimable functions of 6 can be interpreted. 

Let X be n x p of rank k < p <n. Then, by Theorem 12.2b(ii), X’XB represents 
a set of p estimable functions of B. If a side condition were an estimable function of 
B, it could be expressed as a linear combination of the rows of X’Xf and would con- 


tribute nothing to the rank deficiency in X or to obtaining a solution vector B for 
X’'xXB = X’y. Therefore, side conditions must be nonestimable functions of B. 

The matrix X is n x p of rank k < p. Hence the deficiency in the rank of X is 
p —k. In order for all the parameters to be unique or to obtain a unique solution 
vector B. we must define side conditions that make up this deficiency in rank. 
Accordingly, we define side conditions TB =0 or TB =0, where T is a 
(p — k) X p matrix of rank p — k such that TP is a set of nonestimable functions. 

In the following theorem, we consider a solution vector B for both X’XB = X'y 
and TB = 0. 


Theorem 12.6a. If y = XB + e, where X is n x p of rank k <p <n, andif Tisa 
(p — k) x p matrix of rank p — k such that TB is a set of nonestimable functions, then 


there is a unique vector B that satisfies both X’XB = X’y and TB = 0. 


12.6 SIDE CONDITIONS 321 


Proor. The two sets of equations 


y=XBPie 
0=TB+0 


@)-@eG) ws 


Since the rows of T are linearly independent and are not functions of the rows of X, 


Lf B\s X\'(X). 
the matrix 7} is (n+ p — k) x p of rank p. Thus T T is p X p of rank p, 


and the system of equations 


(3) a= (7) (6) oe 


can be combined into 


has the unique solution 


OG] GG 
= x. v)(*)] ow, 1)(5) 


= (X'X+ TT) !(X'y+ T°) 


= (X'X + TT)! X’y. (12.37) 


This approach to imposing constraints on the parameters does not work for full-rank 
models [see (8.30) and Problem 8.19] or for overparameterized models if the con- 
straints involve estimable functions. However if TB is a set of nonestimable func- 
tions, the least-squares criterion guarantees that TB = 0. The solution B in (12.37) 
also satisfies the original normal equations X’ XB = X’y, since, by (12.36) 


(X/X+TT)B=X'y+T’0 
: ‘ (12.38) 
X'XB+TTB=X’y. 


But TB = 0, and (12.38) reduces to X/XB = X’y. 


322 ANALYSIS-OF-VARIANCE MODELS 


Example 12.6. Consider the model yj =w+7+ej,f=1,2,j=1,2 as in 
Example 12.5. The function 7; + 72 was shown to be nonestimable in Problem 


12.5b. The side condition 7, + 7 =O can be expressed as (0,1,1)8 = 0, and 
X’X + T’T becomes 


BAD SS. 422 
22 0}/+/1)](0 1 1)=]2 3 1 
a ae 1 a 1 3S 
Then 
A ae aes 
(XX+T'T)! = ri -1 2 0 
-1 0 2 
With X’y = (y..,y1.,92,)', we obtain, by (12.37) 
B=(X'X4+TT)!X’y 
2y.—Y.— ye, y. (12.39) 
mr 2y1.—y.. a yd. , 
2y2,— y.. Yo, — Y., 


since yj, + y2, =y.. ; 
We now show that B in (12.39) is also a solution to the normal 
equations X/XB = X’y: 


4 2 2 y., La 
22 OF; ¥-¥ f =f oe], oF 
2.0 2/7 \yn—Y. ya. 


4y + 201, —¥.) +202, - ¥.) =y.. 
2y + 20%, -YI=M. 
2y_ +2092. -Y) =r. 


These simplify to 


2y,, + 2y2 = y.. 
2y,, =i. 
2¥y = y2.5 


which hold because y, = y;./2, ¥) = y2,/2 and y). + yo. =y.. 


12.7 TESTING HYPOTHESES = 323 
12.7 TESTING HYPOTHESES 


We now consider hypotheses about the §’s in the model y = XB + €, where X is 
n x p of rank k < p <n. In this section, we assume that y is N, (XB, oI). 


12.7.1 Testable Hypotheses 


It can be shown that unless a hypothesis can be expressed in terms of estimable func- 
tions, it cannot be tested (Searle 1971, pp. 193-196). This leads to the following 
definition. 


A hypothesis such as Ho : 8; = B, = --- = B, is said to be testable if there exists 
a set of linearly independent estimable functions AB, A,B, ..., A,B such that Ho is 
true if and only if A,B = A B=---=AB=0. 


Sometimes the subset of B’s whose equality we wish to test is such that every con- 
trast )*; cB; is estimable ()-; cB; is a contrast if )>;c; = 0). In this case, it is easy to 
find a set of g — 1 linearly independent estimable functions that can be set equal to 
zero to express B, = --- = B,. One such set is the following: 


B= (4 — DB, — (Bo + By ++ + By) 
ALB = (q — 2)B> — (B3 +--+ + By) 


NB = (DB,-1 — (By). 


These g — 1 contrasts A‘, B, ... A , 6 constitute a set of linearly independent 
estimable functions such that 


\B 0 
NB 0 


if and only if Bj = B) =--: = B,. 

To illustrate a testable hypothesis, suppose that we have the model 
yy = Mt a+ B+ ej, += 1,2,3, j= 1,2,3, and a hypothesis of interest is 
Ho: a, = Q) = a3. By taking linear combinations of the rows of XB, we can obtain 
the two linearly independent estimable functions a; — a2 and a; + a2 — 2a3. The 
hypothesis Ho: a; = a2 = a3 is true if and only if a; — a and a; + a2 — 2a3 are 
simultaneously equal to zero (see Problem 12.21). Therefore, Ho is a testable 


324 ANALYSIS-OF-VARIANCE MODELS 


hypothesis and is equivalent to 


: Qa, — a2 = 0 
Ho: (., ta a - Gi (12.40) 


We now discuss tests for testable hypotheses. In Section 12.7.2, we describe a pro- 
cedure that is based on the full-reduced-model methods of Section 8.2. Since 
(12.40) is of the form Ho: CB = 0, we could alternatively use a general linear hypoth- 
esis test (see Section 8.4.1). This approach is discussed in Section 12.7.3. 


12.7.2 Full-Reduced-Model Approach 


Suppose that we are interested in testing Ho: B, = B, = --- = B, in the non-full-rank 
model y = XB + ¢, where B is p x 1 and X isn x p of rank k < p <n. If Hp is tes- 
table, we can find a set of linearly independent estimable functions 


\’B, ASB, ..., A,B such that Ho: B; = B) =--- = B, is equivalent to 
AB 0 
A,B 0 
Ao: ¥, = . = . 
XB : 


It is also possible to find 
X18 
v2 = 
AB 


such that the k functions A/,B, ..., A,B, A), B, ..., A,B are linearly independent 
and estimable, where k = rank(X). Let 


_{% 
. @ 
We can now reparameterize (see Section 12.5) from the non-full-rank model 
y = XB + € to the full-rank model 


y=Zy+e=Ly,+hoyn+€, 


where Z = (Z,, Z2) is partitioned to conform with the number of elements in y, 
and 7. 


12.7 TESTING HYPOTHESES 325 


For the hypothesis Ho: y,; = 0, the reduced model is y = Zoy3 + €*. By Theorem 
7.10, the estimate of v5 in the reduced model is the same as the estimate of 2 in the 
full model if the columns of Z» are orthogonal to those of Z,, that is, if ZZ =O. 
For the balanced models we are considering in this chapter, the orthogonality will 
typically hold (see Section 12.8.3). Accordingly, we refer to y, and ¥, rather than 
to y; and ¥5. 

Since y = Zy + e€ is a full-rank model, the hypothesis Hp: y,; = 0 can be tested as 
in Section 8.2. The test is outlined in Table 12.2, which is analogous to Table 8.3. 
Note that the degrees of freedom t for SS(y,|y2) is the number of linearly indepen- 
dent estimable functions required to express Ho. 

In Table 12.2, the sum of squares ’Zy is obtained from the full model 
y = Zy+«. The sum of squares 7;Z4y is obtained from the reduced model 
y = Zhy + €, which assumes the hypothesis is true. 

The reparameterization procedure presented above seems _ straightforward. 
However, finding the matrix Z in practice can be time-consuming. Fortunately, this 
step is actually not necessary. 

From (12.20) and (12.33), we obtain 


yy—-BX'y=y'y—- YZy, 
which gives 
BX y= YZ, (12.41) 
where B represents any solution to the normal equations X’xB = X’y. Similarly, cor- 


responding to y = Zy3 + €*, we have a reduced model y = X2 fh, + e* obtained by 
setting B; = B, =--: = B,. Then 


By XSy = WDy, (12.42) 


where B; is any solution to the reduced normal equations X5 X25 = Xy. We can 


often use side conditions to find B and Bs. 
We noted above (see also Section 12.8.3) that if ZZ, = O holds in a reparame- 
terized full-rank model, then by Theorem 7.10, the estimate of y; in the reduced 


TABLE 12.2 ANOVA for Testing Ho: y,;=0 in Reparameterized Balanced Models 


Source of Variation df Sum of Squares F Statistic 

Due to y; adjusted for y, t SS(yi|%) = YZ'y — WLoy SS(%|¥%)/t 
SSE/(n — k) 

Error n—k SSE=yly- ¥YZ'y — 


Total n-1 SST = y'y — ny” 


326 ANALYSIS-OF-VARIANCE MODELS 


TABLE 12.3, ANOVA for Testing Ho: 6B, =, =---=, in Balanced 
Non-Full-Rank Models 
Source of Variation df Sum of Squares F Statistic 
Due to B, adjusted for B, t SS(B, |B) = B'X'y — BSXby SS(B,| Bo) /t 
SSE/(n — k) 
Error n—k SSE = y'y — B'X'y — 
Total n—-1 SST = y’y — ny? — 


model is the same as the estimate of y, in the full model. The following is an analo- 
gous theorem for the non-full-rank case. 


Theorem 12.7a. Consider the partitioned model y = XB + € = X\B, + XoB, + €, 
where X is n x p of rank k < p < n. If X4X; = O (see Section 12.8.3), any estimate 
of B; in the reduced model y = X2; + e* is also an estimate of B, in the full model. 


Proor. There is a generalized inverse of 


x'xX, Xx 
X’x a 1<*1 1“*%2 
( X5X; X5X 


analogous to the inverse of a nonsingular symmetric partitioned matrix in (2.50) 
(Harville 1997, pp. 121-122). The proof then parallels that of Theorem 7.10. 


In the balanced non-full-rank models we are considering in this chapter, the ortho- 
gonality of X; and X» will typically hold. (This will be illustrated in Section 
12.8.3) Accordingly, we refer to B, and B, rather than to B5 and BS. 

The test can be expressed as in Table 12.3, in which B X’y is obtained from the full 
model y = Xf + & and BAXSy is obtained from the model y = X2f, + &, which has 
been reduced by the hypothesis Ho: 8; = By =--- = B,. Note that the degrees of 
freedom ¢ for SS(B,|B,) is the same as for SS(y,|y,) in Table 12.2, namely, the 
number of linearly independent estimable functions required to express Ho. 
Typically, this is given by t= q — 1. A set of g — 1 linearly independent estimable 
functions was illustrated at the beginning of Section 12.7.1. The test in Table 12.3 
will be illustrated in Section 12.8.2. 


12.7.3 General Linear Hypothesis 


As illustrated in (12.40), a hypothesis such as Ho : a} = @2 = a3 can be expressed in 
the form Hp : CB = 0. We can test this hypothesis in a manner analogous to that used 
for the general linear hypothesis test for the full-rank model in Section 8.4.1 The fol- 
lowing theorem is an extension of Theorem 8.4a to the non-full-rank case. 


12.7 TESTING HYPOTHESES 327 


Theorem 12.7b. If y is N,(XB, 071), where X is n x p of rank k < p <n, if C 
is m Xx p of rank m <k such that CB is a set of m linearly independent estimable 


functions, and if B = (X'X)~X’y, then 


(i) C(X’X)C’ is nonsingular. 
(ii) CB is Nn[CB, 07 C(X'X) C’]. 
(iii) SSH/o? = (CBy [C(X'X)-C'}'CB/c2 is x°(m,A), where A= (CB)' 
[C(X'X) C’}"'CB/20°. 
(iv) SSE/o? = y’[I — X(X'X)X’]y/o? is Y(n — k). 
(v) SSH and SSE are independent. 


PROOF 


(i) Since 


is a set of m linearly independent estimable functions, then by Theorem 
12.2b(iii) we have e/(X'X)X'X = ec; for i= 1,2, ...,m. Hence 


C(X'X) X’/X =C. (12.43) 
Writing (12.43) as the product 
[C(Xx’X) X’]X = C, 
we can use Theorem 2.4(i) to obtain the inequalities 
rank(C) < rank[C(X’X)X’] < rank(C). 


Hence rank[C(X’X) X’] = rank(C) = m. Now, by Theorem 2.4(iii), which 
states that rank(A) = rank(AA’), we can write 


rank(C) = rank[C(X'X)" X’] 
= rank{[C(X'X)~X'][C(X'X)-X'] 
= rank[C(X'X)~X'X(X’X)-C']. 


328 


(ai) 


(iii) 


(iv) 
(v) 


ANALYSIS-OF-VARIANCE MODELS 


By (12.43), C(X'X)" X’X = C, and we have 
rank(C) = rank[C(X’X)C’]. 


Thus the m x m matrix C(X’X) C’ is nonsingular. [Note that we are assuming 
that (X’X)~ is symmetric. See Problem 2.46 and a comment following 
Theorem 2.8c(v).] 


By (3.38) and (12.14), we obtain 
E(CB) = CE(B) = C(X'X)-X'XB. 
By (12.43), C(X'X)" X’X = C, and therefore 


E(CB) = CB. (12.44) 


By (3.44) and (12.18), we have 


cov(CB) = Ccov(B)C! = 0? C(X'X) X'X(X’X)C’. 
By (12.43), this becomes 


cov(CB) = 0? C(X’X)-C’. (12.45) 


By Theorem 12.3g(i), B is N,[(X'X)" XX, 07 (X'X) X’X(X'X)] for a par- 
ticular (X’X)~. Then by (12.44), (12.45), and Theorem 4.4a(ii), we obtain 


CB is Nn[CB, 0° C(X’X)- CJ. 


By part (ii), cov(CB) = C(X'X)-C’.. Since @[C(X’X)- CT"! 
C(X’'X)-C’/o* =I, the result follows by Theorem 5.5. 

This was established in Theorem 12.3g(ii). 

By Theorem 12.3g(iii), B and SSE are independent. Hence SSH = (CB)' 


[C(X’/X)-C')'CB and SSE are independent [see Seber (1977, pp. 17-18) 
for a proof that continuous functions of independent random variables and 
vectors are independent]. For a more formal proof, see Problem 12.22. 


Using the results in Theorem 12.7b, we obtain an F test for Hp : CB = 0, as given 
in the following theorem, which is analogous to Theorem 8.4b. 


12.8 AN ILLUSTRATION OF ESTIMATION AND TESTING 


329 


Theorem 12.7c. Let y be N,, (XB, o°I), where X is n x p of rank k < p <n, and let 
C, CB, and B be defined as in Theorem 12.7b. Then, if Hp : CB = 0 is true, the statistic 


_ SSH/m 
~ SSE/(n — k) 


_ (CBYIC(X'X)-C'T'CB/m 


SSE/(n — k) 


is distributed as F(m,n — k). 


Proor. This follows from (5.28) and Theorem 12.7b. 


12.8 AN ILLUSTRATION OF ESTIMATION AND TESTING 


Suppose we have the additive (no-interaction) model 


y= oMt+at+ Bez 1=1,2,3, f= 1,2, 


(12.46) 


and that the hypotheses of interest are Hp: a] = a = a3 and Ho: B; = B,. The six 


observations can be written in the form y = XB + € as 


yi 1 100500 Oe 4) ay 
std (a (ek aie ae 
yo | [1 0100 1 03 
Yu 190) OF Tadeo.) 583 
32 100101 Bo 
The matrix X’X is given by 

6 22. 2 3 

22 0 0 1 

pat || 20 Oe A? Ol 

eS 20 02 1 

3 1 1 1 3 

3 1 1 1 +0 


The rank of both X and X’X is 4. 


WD Oe KS SW 


(12.47) 


330 ANALYSIS-OF-VARIANCE MODELS 


12.8.1 Estimable Functions 


The hypothesis Ho: a; = a2 = a3 can be expressed as Hop: a; — a2 =0 and 
a, — a3 = 0. Thus A is testable if a; — a2 and a; — a3 are estimable. To check 
a, — 2 for estimability, we write it as 
Qa, — a2 = (0, 1, =1; 0, 0, 0)B — AB 
and then note that A‘, can be obtained from X as 
d, 0, —l, 0, 0, 0)X = (0, 1, —l, 0, 0, 0) 
and from X’X as 


(0, 5, —}, 0, 0, O)X'X = (0, 1, —1, 0, 0, 0) 


(see Theorem 12.2b). Alternatively, we can obtain a, — a as a linear combination of 
the rows (elements) of E(y) = XB: 


E(yu — yar) = EQ) — E(ya1) 
=pem+ a, + By — (w+ a + Bi) 
= A, — a2. 


Similarly, a; — a3 can be expressed as 
Qa) — B= (0, 1, 0, =I, 0, O)p = ASB, 
and A‘, can be obtained from X or X’X: 


qd, 0, 0, 0, —l, 0)X _— (0, 1, 0, —1; 0, 0), 
(0, 5 0, —}, 0, 0)xX’X = (0, 1, 0, —l, 0, 0). 

It is also of interest to examine a complete set of linearly independent 
estimable functions obtained as linear combinations of the rows of X [see Theorem 
12.2bG) and Example 12.2.2b]. If we subtract the first row from each succeeding 
row of X, we obtain 


1 1 0 0 1 0 
0 00 0 -1 1 
0 -1 10 00 
0 -l1 10-1 1 
0 -1 0 1 0 0 
0 -1 01 -1 1 


12.8 AN ILLUSTRATION OF ESTIMATION AND TESTING 331 


We multiply the second and third rows by — 1 and then add them to the fourth row, 
with similar operations involving the second, fifth, and sixth rows. The result is 


1 1 0 oO 1 0 
00 0 01 -!I 
0 1 -l 0 0 O 
00 0 00 0 
0 1 0 -1 0 O 
00 0 00 0 


Multiplying this matrix by B, we obtain a complete set of linearly independent 
estimable functions: w+ a; + B,, B, — Bo, a; — a2, a, — a3. Note that the esti- 
mable functions not involving yw are contrasts in the a’s or B’s. 


12.8.2 Testing a Hypothesis 


As noted at the beginning of Section 12.8.1, Ho :a,; = a2 = a3 is equivalent to 
Hy: a; — a2 = a; — a; = 0. Since two linearly independent estimable functions 
of the a’s are needed to express Ho: a; = a) = a3 (see Theorems 12.7b and 
12.7c), the sum of squares for testing Hp : a; = a) = a3 has 2 degrees of freedom. 
Similarly, Ho : 8; = B, is testable with | degree of freedom. 

The normal equations X’/XP = X’y are given by 


62223 3\/f y. 
yee ames VS as a Wa WY Yi. 
2020 11//&/_ |» 
200211\|/a\|7]} (12.48) 
3 1113 O}1 6 ya 
Bo 0 3 Bs y2 


If we impose the side conditions a, + @ + @3; = 0 and B, + By = 0, we obtain the 
following solution to the normal equations: 


(12.49) 


where y= )7,, 9/6, Y1, = D0) 1/2, and so on. 

If we impose the side conditions on both the parameters and the estimates, 
equations (12.49) are unique estimates of unique meaningful parameters. Thus, for 
example, a, becomes aj = ft, — 1. the expected deviation from the mean due to 
treatment | (see Section 12.1.1), and y, —y. is a reasonable estimate. On the 
other hand, if the side conditions are used only to obtain estimates and are not 
imposed on the parameters, then a ; is not unique, and y, — y_ does not estimate a 
parameter. In this case, @} = y, — y_ can be used only together with other elements 


in B [as given by (12.49)] to obtain estimates NB of estimable functions A’. 


332 ANALYSIS-OF-VARIANCE MODELS 


We now proceed to obtain the test for Hp : a} = a2 = a3 following the outline in 
Table 12.3. First, for the full model, we need BX'y = SS(u, a1, a2, a3, By, Bo), 
which we denote by SS(u, a, B). By (12.48) and (12.49), we obtain 


y.. 

a eee . | 

SS(u, a, B) = BX'y = (f, &, ..., Bo) 
y2 

= fy, + &y1, + Gays, + G3y3, + By. + Boye 


3 2 
=yy.+ 5001 -yov + 406) - Fy 
i=1 =1 


J 


Y (oR_y% i ¥ 
=S+ <i +=], 12.50 
6 2 2 6 © » 3 6 ( ) 
since )>,y;, = y,, and yi yj = y... The error sum of squares SSE is given by 
2 


: ee ay yi y¥ 
wy BX'y = 2 i i. : : |) 
yy—BX’y di 7 SD el lla 


j=l 


To obtain BoXhy in Table 12.3, we use the reduced model yj = w+ a+ 
B, + &j = w+ B+ €y, where a; = a = a3 =a and w+a is replaced by pm. 
The normal equations XX» B, = Xy for the reduced model are 
6ft + 38) + 3B) =y. 
3jt + 3B) = yt 


3 ft + 3B) =y2. (12.51) 


Using the side condition Bi + Bo =0, the solution to the reduced normal 
equations in (12.51) is easily obtained as 


h=y9, Br=I1-¥5 Br=Jo-VJ.. (12.52) 
By (12.51) and (12.52), we have 


2, 
- . - - y 
SS(u, B) = BXsy = py. + Biya t+ Boy2 = et ( 


2 
SR) Me 
3-2). (12.53) 


12.8 AN ILLUSTRATION OF ESTIMATION AND TESTING 333 


TABLE 12.4 ANOVA for Testing Ho : ay =a. =a03 


Source of Variation df Sum of Squares F Statistic 
Due to @ adjusted for pw, B 2 y yy yp 
SS . — po ee Ji 
(a|q, B) ig ig Mie ral 
SSE/2 
= 2 _ ply! — 
Error p) SSE = Li = Ex y 


Abbreviating SS(a1, a2, @3| 4, BB) as SS(a|u, a), we have 


2 2 
SS(a|ju, B) = B'X’y — BLXsy = = es = (12.54) 


The test is summarized in Table 12.4. [Note that SS(B|, @) is not included.] 


12.8.3 Orthogonality of Columns of X 


The estimates of jz, B,, and 6, given in (12.52) for the reduced model are the same as 
those of uw, B,, and 8, given in (12.49) for the full model. The sum of squares BLXy 


in (12.53) is clearly a part of BX'y in (12.50). In fact, (12.54) can be expressed as 
SS(a|u, B) = SS(a@), and (12.50) becomes SS(u, a, B) = SS(u) + SS(a) + SS(B). 
These simplified results are due to the essential orthogonality in the X matrix in 
(12.47) as required by Theorem 12.7a. There are three groups of columns in the X 
matrix in (12.47), the first column corresponding to p, the next three columns corre- 
sponding to a}, a2, and a3, and the last two columns corresponding to B, and B,. 
The columns of X in (12.47) are orthogonal within each group but not among 
groups as required by Theorem 12.7a. However, consider the same X matrix if 
each column after the first is centered using the mean of the column: 


if ll EE as et 
3 3 3 2 2 
2: 1 1 1 
1 3-3 73 72 2 
y —l 2 _1 lel 
3 3 3 2 2 
(j,X,) = (12.55) 
y —1l 2 _1 _1 1 
3 3 3 2 2, 
1 1 2 1 
ea ee ee ee ea 
fee. 8. a 
3 3 3 2 2 


334 ANALYSIS-OF-VARIANCE MODELS 


Now the columns are orthogonal among the groups. For example, each of columns 2, 
3, and 4 is orthogonal to each of columns 5 and 6, but columns 2, 3, and 4 are not 
orthogonal to each other. Note that rank(j, X,.) = 4 since the sum of columns 2, 3, 
and 4 is 0 and the sum of columns 5 and 6 is 0. Thus rank(j, X,) is the same as 
the rank of X in (12.47). 

We now illustrate the use of side conditions to obtain an orthogonalization that is 
full-rank (this was illustrated for a one-way model in Section 12.1.1.). Consider the 
two-way model with interaction 


Vik = Mt a+ B+ yt ee, F=1,2; f= 1,2; k= 1,2. (12.56) 


In matrix form, the model is 


yu | aes 6 it ex ae eve B e111 
yin ti 208 OP AO? OG || S115 
is fh I Oe de, Oe Or 2 B, e101 
yi22 Poe 200 20). ok 0: 20 1 E100 
wae fe hones BOs 60. TON Pee ke ae PS 
Yo12 tO 4 470 "0. OF To] | oe 
— 101010001 f) % B30 
Y222 10! HE SOL Oy O00 A ie Sun 
22 


Useful side conditions become apparent in the context of the normal equations, 
which are given by 
Bf + (ay + 2) + 4(B, + Bo) + A + V2 + Yor + Yn) =¥.. 
Afi + 48; + 2B, + Br) +241 + Yn) = Yin. F= 1,2 
Afi + 2(ay + Go) +48, +2; + ¥y) =r, F= 1,2 


2+ 2a; +28 +2%j=yy, 7=1,2, fF=1,2 


Solution of the equations in (12.58) would be simplified by the following side con- 
ditions: 


@ + @ = 0, Bi + Bo = 0, 
Vat V2=0, i= 1,2, (12.59) 
Nit =0, f= 1,2. 


In (12.57), the X matrix is 8 x 9 of rank 4 since the first five columns are all 
expressible as linear combinations of the last four columns, which are linearly inde- 
pendent. Thus X’X is 9 x 9 and has a rank deficiency of 9 — 4 = 5. However, there 
are six side conditions in (12.59). This apparent discrepancy is resolved by noting that 


12.8 AN ILLUSTRATION OF ESTIMATION AND TESTING 335 


there are only three restrictions among the last four equations in (12.59). We can 
obtain any one of these four from the other three. To illustrate, we obtain the first 
equation from the last three. Adding the third and fourth equations gives 
Vit + Yor + V2 + 2 = 0. Then substitution of the second, 7, + ¥. = 0, 
reduces this to the first, y,; + Yj. = 0. 

We can obtain a full-rank orthogonalization by imposing the side conditions 
in (12.59) on the parameters and using these relationships to express redundant 
parameters in terms of the four parameters p, a;, B;, and y,,;. (For exposi- 
tional convenience, we do not use * on the parameters subject to side conditions.) 
This gives 


a2 = —aQ, B> — —B,, 


Y2= Vi Ya= Vi Y22 = Vi- 


(12.60) 


The last of these, for example, is obtained from the side condition y,. + y. = 0. 


Thus yy. = — 12 = —(— ¥11)- 
Using (12.60), we can express the eight yj, values in (12.56) in terms of 


BM, @, Bi, and Yiu: 


Yuk =BMtat+By+yit+enn k=1,2, 
Yi2e = M+ a + By + V2 + E12 
=pt+a—Py- i+ ex, k= 1,2, 
Yrik = M+ 2 + By + Yo) + ©21k 
=p-a +P, -¥i +e, k= 1,2, 
york = Wt a2 + By + Yo2 + 22K 
=bM-—a—PBy+ 1+ €2%, k= 1,2. 


The redefined X matrix thus becomes 


1 1 1 1 
1 1 1 1 
1 1 -1 -l 
1 1 -1 -l 
1 -l 1 -1]’ 
1 -l 1 -l 
1 -l -l 1 
1 -1 -!l 1 


which is a full-rank matrix with orthogonal columns. The methods of Chapters 7 and 
8 can now be used for estimation and testing hypotheses. 


336 


ANALYSIS-OF-VARIANCE MODELS 


PROBLEMS 
12.1 Show that fy, + fly, = 21, as in (12.9). 
12.2 Show that é’é in (12.10) is minimized by 8, the solution to X/XB = X’y in 
(12.11). 
12.3. Use Theorem 2.7 to prove Theorem 12.2a. 
12.4 (a) Give an alternative proof of Theorem 12.2b(iii) based on Theorem 
2.8c(iii). 
(b) Give a second alternative proof of Theorem 12.2b(iii) based on Theorem 
2.8f. 
12.5 (a) Using all three conditions in Theorem 12.2b, show that A’B= 
w+ 7 = (1, 0, 1)B is estimable (use the model in Example 12.2.2a). 
(b) Using all three conditions in Theorem 12.2b, show that A’B= 
| +7 = (0, 1, IB is not estimable. 
12.6 If A’B is estimable and B, and Bo are two solutions to the normal equations, 
show that A’B, = A/B, as in Theorem 12.3a(iii). 
12.7 Obtain an estimate of + 7 using r’X’y and NB from the model in Example 
12.3.1. 
12.8 Consider the model yj = w+ 7 + ey, i= 1, 2, j = 1, 2, 3: 
(a) For A’B=(1,1,0)8 = w+ 7, show that 
> 0 
r=c + ; ; 
0 
with arbitrary c, represents all solutions to X’Xr = A. 
(b) Obtain the BLUE [best linear unbiased estimator] for w+ 7; using r 
obtained in part (a). 
(c) Find the BLUE for 7; — 72 using the method of parts (a) and (b). 
12.9 (a) In Example 12.2.2b, we found the estimable functions 


Ai B= w+ai+B,, 4B = B, — Bo, and A,B =a, — ay. Find the 
BLUE for each of these using r’X’y in each case. 
(b) For each estimator in part (a), show that E(r;X'y) = X, B. 


12.10 In the model yy = w+7+ ej,i= 1,2, ...,k; 7 = 1, 2,...,n, show that 


ear ci7; is estimable if and only if es c; = 0, as suggested following 
Example 12.2.2b. Use the following two approaches: 


(a) Ind’ p= baa ci7;, express A’ as a linear combination of the rows of X. 


12.11 


12.12 


12.13 


12.14 


12.15 


12.16 
12.17 
12.18 


12.19 


12.20 


12.21 


12.22 
12.23 


12.24 


12.25 


PROBLEMS 337 


(b) Express aa cit; aS a linear combination of the elements of 
Ety) = XB. 


In Example 12.3.1, find all solutions r for X'Xr = A and show that all of 
them give r’X’y = y, —j>. 


Show that cov(A‘ B, ASB) = 071), Ay = 02 Airy = 072A (X/X) Ap as in 
Theorem 12.3c. 


(a) Show that (y — XB) (y — XB) = y'y — BX’y as in (12.20). 
(b) Show that y’y — p’X’y = y'[I — X(X’X)_X’Jy as in (12.21). 


Show that f'X'[I — X(X’X)X’]XB =0, as in the proof of Theorem 
12.3e(i). 


Differentiate In L(B, o7) in (12.26) with respect to B and o to obtain (12.27) 
and (12.28). 


Prove Theorem 12.3g. 

Show that A'B = b'y = ¢’6 as in (12.34). 

Show that the matrix Z in Example 12.5 can be obtained using (12.31), 
Z=XU'(UU’)"'. 


Redo Example 12.5 with the parameterization 


_(erT 
aa & 1-7 i 
Find Z and U by inspection and show that ZU = X. Then show that Z can be 


obtained as Z = XU'(UU’)"!. 

Show that B in (12.39) is a solution to the normal equations X’'xB = X’y. 
a) — a2 = 0 : : i. = = 

Show that & a ayesees i -_ (3) in (12.40) implies a, = ay = a3, as 

noted preceding (12.40). 

Prove Theorem 12.7b(v). 


Multiply X’X in (12.48) by B to obtain the six normal equations. Show that 
with the side conditions @| + @) + @3 = 0 and Bi + Bo = 0, the solution is 
given by (12.49). 


Obtain the reduced normal equations X5X2B> = Xy in (12.51) by writing 
X>» and X)X2 for the reduced model yj = pu + B; + ey, = 1,2,3, j= 1,2. 


Consider the model yj = w+ 7 + ey, i= 1, 2,3, j= 1, 2, 3: 
(a) Write X, X’X, X’y, and the normal equations. 


338 


12.26 


12.27 


12.28 


ANALYSIS-OF-VARIANCE MODELS 


(b) What is the rank of X or X'X? Find a set of linearly independent esti- 
mable functions. 


(c) Define an appropriate side condition, and find the resulting solution to 
the normal equations. 


(d) Show that Ho:7; = m = 73 is testable. Find pX'y = SS(u, 7) and 
B)Xsy = SS(u). 
(e) Construct an ANOVA table for the test of Hp: 7, = ™ = 73. 


Consider the model y= wtajit+ B+ yyt exe, 1= 1,2, j= 1,2, 
p13. 
(a) Write X'X, X’y, and the normal equations. 


(b) Find a set of linearl independent estimable functions. Are Qa, — a2 and 
y 
By = B estimable? 


Consider the model yi, = wt ait B+ yy + ix, 1= 1,2, j = 1, 2, 
k=1,2. 

(a) Write X'X, X’y, and the normal equations. 

(b) Find a set of linearly independent estimable functions. 


(c) Define appropriate side conditions, and find the resulting solution to the 
normal equations. 


(d) Show that Ho: a, = a» is testable. Find pX'y = SS(p, a, B, y) and 
BLX5y = SS(u, B, y)- 
(e) Construct an ANOVA table for the test of Hp: a, = ap. 


For the model yj. = + aj + B+ Vy + Fix, t 1,2, 7=1,2,k=1,2in 
(12.56), write X’X and obtain the normal equations in (12.58). 


13 One-Way Analysis-of-Variance: 
Balanced Case 


The one-way analysis-of-variance (ANOVA) model has been illustrated in Sections 
12.1.1, 12.2.2, 12.3.1, 12.5, and 12.6. We now analyze this model more fully. To 
solve the normal equations in Section 13.3, we use side conditions as well as a gen- 
eralized inverse approach. For hypothesis tests in Section 13.4, we use both the full— 
reduced-model approach and the general linear hypothesis. Expected mean squares 
are obtained in Section 13.5 using both a full—reduced-model approach and a 
general linear hypothesis approach. In Section 13.6, we discuss contrasts on the 
means, including orthogonal polynomials. Throughout this chapter, we consider 
only the balanced model. The unbalanced case is discussed in Chapter 15. 


13.1 THE ONE-WAY MODEL 


The one-way balanced model can be expressed as follows: 
Yy=bMtatejy, i=1,2,...,k, j=1,2,...,n. (13.1) 


If a1, Q2, ..., Q% represent the effects of k treatments, each of which is applied to n 
experimental units, then y,; is the response of the jth observation among the n units 
that receive the ith treatment. For example, in an agricultural experiment, the treat- 
ments may be different fertilizers or different amounts of a given fertilizer. On the 
other hand, in some experimental situations, the k groups may represent samples 
from k populations whose means we wish to compare, populations that are not 
created by applying treatments. For example, suppose that we wish to compare the 
average lifetimes of several brands of batteries or the mean grade-point averages 
for freshmen, sophomores, juniors, and seniors. Three additional assumptions that 
form part of the model in (13.1) are 


1. E(ej) = 0 for all i, j. 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


339 


340 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


. var(ey) = o” for all i, Ts 
. COV(E;, E,s) = O for all (i, 7) ¥ (7, 5). 
. We sometimes add the assumption that ¢, is distributed as N(O, Oo). 


nb Wd 


. In addition, we often use the constraint (side condition) yy a; = 0. 


The mean for the ith treatment or population can be denoted by p;. Thus Ej; = p,;, 
and using assumption 1, we have uw; = w+ a;. We can thus write (13.1) in the 
form 


Yj = Mj tex 1=1,2,...,k, f=1,2,...,n. (13.2) 
In this form of the model, the hypothesis Ho: w, = mM, = --- = My 1s Of interest. 
In the context of design of experiments, the one-way layout is sometimes called a 


completely randomized design. In this design, the experimental units are assigned at 
random to the k treatments. 


13.2. ESTIMABLE FUNCTIONS 


To illustrate the model (13.1) in matrix form, let k = 3 and n = 2. The resulting six 
equations, yj = w+aj;+ ej, i= 1,2,3, j = 1,2, can be expressed as 


Yu Mr Q E11 
12 Mr Q €12 
y21 = Mr a 4 €21 
22 Mr @ E22 
y31 Mr 0% €31 
32 Mr % €32 

1 1 0 0 E11 

1 1 0 0 fea E12 

us 101 0 Qy & £2] (13.3) 

1 01 0 a2 £22 

100 1 a3 £3] 

100 1 £32 

or 

y=XB-+e. 


In (13.3), X is 6 x 4 and is clearly of rank 3 because the first column is the sum of the 
other three columns. Thus B = (ju, a1, @2, @3)' is not unique and not estimable; hence 


13.3. ESTIMATION OF PARAMETERS 341 


the individual parameters pL, a), @2, a3 cannot be estimated unless they are subject to 
constraints (side conditions). In general, the X matrix for the one-way balanced model 
is kn X (k + 1) of rank k. 

We discussed estimable functions A’B in Section 12.2.2. It was shown in 
Problem 12.10 that for the one-way balanced model, contrasts in the a’s are esti- 
mable. Thus ys cja; is estimable if and only if Xe c; = 0. For example, contrasts 
such as aj —a@ and a; —2a7 + a3 are estimable. 

If we impose a side condition on the a;’s and denote the constrained parameters as 
m* and a;, then w*, aj, ..., a% are uniquely defined and estimable. Under the usual 
side condition, ee a; =0, the parameters are defined as u* =p and 
a* = p,— jz, where p =~, p,/k. To see this, we rewrite (13.1) and (13.2) in 
the form E(y) = uw; = w* + a; to obtain 


k * 
je a 
I= l 


* a; * 
=p +p Se. (13.4) 
Then, from yu; = u* + a}, we have 
a; = Wj — w= By — (13.5) 


13.3. ESTIMATION OF PARAMETERS 


13.3.1 Solving the Normal Equations 


Extending (13.3) to a general k and n, the one-way model can be written in matrix 
form as 


Yi j j 0 0 ze E| 
Y2 j oj --- 0 : £2 

=/..., . 1+]. I, (13.6) 
Yk i Ane ah ere Ex 


or 


y=XBre, 


342 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


where j and 0 are each of size n x 1, and y; and ¢; are defined as 


Yil Ei 

Yi2 €i2 
Yi > & = 

Yin Ein 


kn non n fu y 
non 0O --: 0 ay YI 
n On 0 a | =|] 2 7, (13.7) 
n 0 0 n ay Vk 


where y. = D0, yj and yi, = D0) Yi- 
In Section 13.3.1.1, we find a solution of (13.7) using side conditions, and in 
Section 13.3.1.2 we find another solution using a generalized inverse of X'X. 


13.3.1.1 Side Conditions 
The k + 1 normal equations in (13.7) can be expressed as 


kno + na, + n@z +--+ +nde =y., 
nw+tna;=y,, 1=1,2,...,k. (13.8) 


Using the side condition )>; @; = 0, the solution to (13.8) is given by 


oe ee 
kn a 

be Ee yeh 1 A= > 7 

a =——-—p=y,—y,, 1=1,2,...,k. (13.9) 
n 


In vector form, this solution B for xX’'xB = X’y is expressed as 


p= (13.10) 


If the side condition )>, a* = 0 is imposed on the parameters, then the elements 
of PB are unique estimators of the (constrained) parameters * =~ and 


13.3 ESTIMATION OF PARAMETERS 343 
a; = p,—m,i=1,2,...,k, in (13.4) and (13.5). Otherwise, the estimators in 


(13.9) or (13.10) are to be used in estimable functions. For example, by Theorem 
12.3a(i), the estimator of A’ B = a, — a» is given by A’ B: 


N B= an = & — & = 5) —F, — Gy, —F.) =F, — Jp. 


By Theorem 12.3d, such estimators are BLUE. If ¢; is M(O, 0°), then, by Theorem 
12.3h, the estimators are minimum variance unbiased estimators. 


13.3.1.2 Generalized Inverse 
By Corollary 1 to Theorem 2.8b, a generalized inverse of X’X in (13.7) is given by 


0 0 0 
1 
Qe 0 
ee n 
(XxX =|] | Me (13.11) 
0: Qoraiens 22 


Then by (12.13) and (13.7), a solution to the normal equations is obtained as 


‘ J 


B=('XyXy=] | |]. (13.12) 


Nk. 


The estimators in (13.12) are different from those in (13.10), but they give the 
same estimates of estimable functions. For example, using B from (13.12) to estimate 
NB = a — a>, we have 


NB = ay — on = a — &) =F, — J, 
which is the same estimate as that obtained above in Section 13.3.1.1 using B from 


(13.10). 


13.3.2. An Estimator for o7 


In assumption 2 for the one-way model in (13.1), we have var(e;) = o for all i, j. To 
estimate 07, we use (12.22) 


> SSE 
es 
k(n — 1)’ 


344 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


where SSE is as given by (12.20) or (12.21): 
SSE = y'y — p’X'y = y'[I — X(X'X) X'ly. 


The rank of the idempotent matrix I — X(X'X)~ X’ is kn — k because rank(X) = k, 
tr(T) = kn, and tr[X(X'X)X’] = k (see Theorem 2.13d). Then es SSE/k(n — 1) 
is an unbiased estimator of o* [see Theorem 12.3e(i)]. 

Using B from (13.12), we can express SSE = y’y — B'Xy in the following form: 


k n k 
SSE=yy-BX’y= 5° Soy} - Soy: 
=1 j=l i=l 


2 
=)-¥-)5 ion (13.13) 
ij i 


It can be shown (see Problem 13.4) that (13.13) can be written as 


SSE = 9° 0 —5,)"- (13.14) 
7] 
Thus s* is given by either of the two forms 
yoy OW — nT? 
v= Tan (13.15) 
yh Davi /n a8 
k(n — 1) 


13.4 TESTING THE HYPOTHESIS Hy: 11 = po = °°: = bm 


Using the model in (13.2), the hypothesis of equality of means can be expressed as 
Ho: by = bo = +++ = My. The alternative hypothesis is that at least two means are 
unequal. Using w; = w+ a; [see (13.1) and (13.2)], the hypothesis can be expressed 
as Hy: a, = @2 =--- = ax, which is testable because it can be written in terms of 
k—1 linearly independent estimable contrasts, for example, Ho: a, — a2 = aj— 
a3 = ++: = a; — a, = 0 (see the second paragraph in Section 12.7.1). In Section 
13.4.1 we develop the test using the full—-reduced-model approach, and in Section 
13.4.2 we use the general linear hypothesis approach. In the model y = XB + e, 
the vector y is kn x | [see (13.6)]. Throughout Section 13.4, we assume that y is 
Nin(XB, 071). 


13.4.1 Full—Reduced-Model Approach 
The hypothesis 


Ho:a, = =::-=a4% (13.17) 


13.4 TESTING THE HYPOTHESIS Ho: 4, =po=-:-=p, 3345 
is equivalent to 
Ho:ai =a, =-:- =a, (13.18) 


where the aj terms are subject to the side condition 5; a7 = 0. With this constraint, 
Ho in (13.18) is also equivalent to 


Hj :a =a, =--- =a, =0. (13.19) 


The full model, yj = w+ a; + ej,i=1,2,...,k,7 = 1,2, ...,n, is expressed 
in matrix form y = XB + e in (13.6). If the full model is written in terms of p* 
and a;* as yj = w* + a; + ,, then the reduced model under Hp in (13.19) is 
yj = M* + ey. In matrix form, this becomes y = p*j + €, where j is kn x 1. To be 
consistent with the full model y = Xf + &, we write the reduced model as 


y=pjte. (13.20) 


For the full model, the sum of squares SS(u, a) = px’ y is given as part of 
(13.13) as 


k 2 
SS(u, a) = BX'y = S>*% 
(u, a) = BXy Lan 


where the sum of squares SS(p, a1, ...,a@%) is abbreviated as SS(u, a). For the 
reduced model in (13.20), the estimator “B = (X’X)- 1 X'y” and the sum of squares 
“B/X'y” become 


. ee 1 - 
b= OD Iy= 7): = 3s (13.21) 
2 


om 
kn? 


SS(u) = (B)''y = 5... (13.22) 


where j is kn x 1. 
From Table 12.3, the sum of squares for the a@’s adjusted for ps is given by 


- 2 
SS(a| 1) = SS(u, a) — SS(u) = BX'y — - 


k 2 
aie aaa (13.23) 


=n) 701, - 3) (13.24) 


346 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


TABLE 13.1 ANOVA for Testing Hp : a; = a2 = --- = a, in the One-Way Model 
Source of Mean 
Variation df Sum of Squares Square F Statistic 
s Iya (alu) — SS(a|m)/(k = 1) 
Treatments k-1 SS(a|u) = no ka SS ke SSE/k(n — 1) 
1 SSE 
7 2537-233 |= SSE = 
Error k(n — 1) SSE di joa k(n — 1) 
Total = kn—-1 sk 97 
ola n — SST = 4 
»—» a ie 


The test is summarized in Table 13.1 using SS(a|) in (13.23) and SSE in (13.13). 
The chi-square and independence properties of SS(a|) and SSE follow from results 
established in Section 12.7.2. 

To facilitate comparison of (13.23) with the result of the general linear hypothesis 
approach in Section 13.4.2, we now express SS(a|j) as a quadratic form in y. By 


(12.13), B = (X’X)-X’y, and therefore B’X'y = y’X(X'X)-X’y. Then with (13.21) 
and (13.22), we can write 


ay! y 
SS(a|m) = BX'y — 7 
n 
= yX(X'X) X’y — y jen inden) Fin¥ 
Jandi 
_ y’X(X'X)-X’y -_ v( : )y 
n 
1 
= y [X(X’X) Xx’ — at| y. (13.25) 


Using some results in the answer to Problem 13.3, this can be expressed as 


1/0 J... O 1|J ee | 
SS(a|u) = y! Ws 2 ale ee J WdY (13.26) 
roe) J JJ J 
Car ei Vas ao 
1) -F &-DE-  -I 
moe : y, 
~J 25: he eS 


where each J in (13.26) and (13.27) is n x n. 


13.4 TESTING THE HYPOTHESIS Ho: ;=bo=---=p, 347 


TABLE 13.2 Ascorbic Acid (mg/100g) for Three 


Packaging Methods 
Method A B C 
14.29 20.06 20.04 
19.10 20.64 26.23 
19.09 18.00 22.74 
16.25 19.56 24.04 
15.09 19.47 23.37 
16.61 19.07 25.02 
19.63 18.38 23.27 
Totals (y;.) 120.06 135.18 164.71 
Means (¥;) 17.15 19.31 23.53 


Example 13.4. Three methods of packaging frozen foods were compared by Daniel 
(1974, p. 196). The response variable was ascorbic acid (mg/100g). The data are in 
Table 13.2. 

To make the test comparing the means of the three methods, we calculate 


y? (419.95) 


== = + = 8298.0001 
kn (3)(7) , 


1 1 

=) y= ; [(120.06)? + (135.18)? + (164.71)?] 
1 

= 7 (59, 817.4201) = 8545.3457, 


3 ti 
S~ S5 yj = 8600.3127. 


The sums of squares for treatments, error, and total are then 


SS(a@| ju) = soo = 8545.3457 — 8398.0001 = 147.3456, 


SSE = » Yin s y? = 8600.3127 — 8545.3457 = 54.9670, 


SST = ae — =~ = 8600.3127 — 8398.0001 = 202.3126. 


These sums of squares can be used to obtain an F test, as in Table 13.3. The p value 
for F = 24.1256 is 8.07 x 10 °. Thus we reject Hy: my = My = M3. 


348 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


TABLE 13.3 ANOVA for the Ascorbic Acid Data in Table 13.2 


Sum of Mean 
Source df Squares Square F 
Method 2 147.3456 73.6728 24.1256 
Error 18 54.9670 3.0537 —_— 
Total 20 202.3126 


13.4.2 General Linear Hypothesis 


For simplicity of exposition, we illustrate all results in this section with k = 4. In this 
case, B = (2, 1, Q2, 3, a4)’, and the hypothesis is Hp : a, = a2 = a3 = ay. Using 
three linearly independent estimable contrasts, the hypothesis can be written in the 
form 


a, — a2 0 
Ho a, — a3 —_ 0 . 
a; — a4 0 


0 
c=|0 1 0 -1 0}. (13.28) 
0 


The matrix C in (13.28) used to express Hp : a] = @2 = a3 = az is not unique. Other 
contrasts could be used in C, for example 


O01 —- 0 0 1 1 -1l -l 
C;=|0 0 1 -1 0 or C,s=[{0 1 -l 0 0 
0 0 O —l 00 O 1 -l 


From (12.13) and Theorem 12.7b(iii), we have 


SSH = (CBY[C(X’X) C’']'CB 
= y’X(X’X)-C'[C(X'X)-C}-'C(X'X)-X’y. (13.29) 


13.4 TESTING THE HYPOTHESIS Ho: ;=-o=---=p, 349 


Using C in (13.28) and (X’X)~ in (13.11), we obtain 


0 0 0 0 0 0 0 0 
; 0 1 -!l 0 0 0 1 0 0 0 1 1 
C(x’xX) C=-]|]0 1 0 -l 0 0 0 1 0 0 —1 0 0 
n 
0 1 0 0 -il 0 0 0 1 +0 0 -!l 0 
0000 1 0 QO -l 
; 2 iT ol 
ce 1 2 14. (13.30) 
1 1 2 
To find the inverse of (13.30), we write it in the form 
1 1 0 0 1 1 1 
C(X'X)C' =- 0 1 074+ 471 1 ~1 = -(b + jj5). 
FN Ne 20% I hoe | u 
Then by (2.53), the inverse is 
C'LigE! 
[CX CT! =n (1 = ei | 
1+ 315 js; 
1 
=n(b-7), (13.31) 


where Jz is 3 x 3. 
For C(X’X) X’ in (13.29), we obtain 


Iw\-y! 1 */ u */ / 1 
COs j, 0 -j, 0 | =-A, (13.32) 


where jj, and 0’ are 1 x n. 
Using (13.31) and (13.32), the matrix of the quadratic form for SSH in (13.29) can 
be expressed as 


1 ie | 
X(X’X)" C'[C(X’X)" C1 C(X’X)-X’ = - A'n (1: = 7) =e 
n n 


1 1 
=—A‘/LA ——A’J,A. (13.33) 
n 4n 


350 


The first term of (13.33) is given by 


Ie, da hh i! 

Tage te 2 OR he 
n n)} 0 -j, 90 4 
0 o 4,/ * 

2 es Caen 

if—J, J, O Oo 

“n| Sn O J, O 
J, O O J, 


since j,j;, = J, and j,,0' = O, where O is n x n. 
second term of (13.33) is given by 


ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


—ji, 0’ 0’ 
0 -j, Of 
0’ 0’ —ji, 
, (13.34) 


OJ; 3d 33, 35, 
1 in 1 —3J, Ji, Jn Jn 
an’ 34 = an 35, J, J, J, (13.35) 
—33n Sn Sn Sn 
Then (13.33) becomes 
123, —43n —4JIn —4Sn 
1 1 1 | —4J3n 4Jn, O O 
(4A’A) AJA = 
4n 4n 4n| —4J, O 4Sn O 
—4Jn O O 4J, 
Wn —3Jn —3Sn —35n 
11) 233 ° Oy a Dy 
4n|} -3J, Sn Jn In 
3 T. t ah 
3Jn —In, —Sn —Sh 
gilt Ue Siete el eM: diag 
~ An —Sn —Sn 395 —Sn ~ An , 
—Sn —S —Sh 3Sn 


Note that the matrix for SSH in (13.36) is the same as the matrix for SS(a|) in 


(13.27) with k = 4. 


13.5 EXPECTED MEAN SQUARES 351 


For completeness, we now express SSH in (13.29) in terms of the y,;’s. We begin 
by writing (13 .36) in the form 


4J, O O O Sn Sn Sn Sn 
1 1 O 44d, Oo O 1 {Jn Jno Sno Sn 
4n- 4n| O O 45, OF 42| Sn Jn Jn In 
O O O 44J, J, J, Jn Sn 
J, O O O 
O J, O O 1 
=e —5- San. 
nj} Oo OJ, O 4n 
Oo O O UJ, 


Using y' = (yy, Y4, ¥4) as defined in (13.6), SSH in (13.29) becomes 


SSH = y'X(X'X)~C'[C(X’/X)- CC] 1 C(X/X)-X’y 


1 
B 
=¥(=8)s 


Sn O O O Yi 
Ue pede gouge | OP die FOO “0 Y> Aj 
= _-_ Jan 
5 1 ¥29¥39 Ya) 0 04, 0 7 and any 
O O O J, V4 


=i Ini — PY Iny 
1 4 Je sf 1 Jeo 
= na Vda — ZY ied 
te 1 
=-S yi-zy, 
emer 4n 


which is the same as SS(a|2) in (13.23). 


13.5 EXPECTED MEAN SQUARES 


The expected mean squares for a one-way ANOVA are given in Table 13.4. The 
expected mean squares are defined as E[SS(a|u)/(k—1)] and E[SSE/k(n — 1)]. 
The result is given in terms of parameters a* such that 57; a7 = 0. 


352 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


TABLE 13.4 Expected Mean Squares for One-Way ANOVA 


Source of Sum of Mean Expected Mean 
Variation df Squares Square Squares 
Treatments k-1 SS(a|u) Sst) a +p a? 
Error k(n — 1) SSE SSE o 
D k(n — 1) 
Total kn—1 ye ie > 
If Hj: aj = a3 =--- =a; =0 is true, both of the expected mean squares 


are equal to oO’, and we expect F to be close to 1. On the other hand, if Ho is 
false, E[SS(a|)/(k — 1)] > E[SSE/k(n — 1)], and we expect F to exceed 1. We 
therefore reject Ho for large values of F. 

The expected mean squares in Table 13.4 can be derived using the model 
yj = we + 0% + e% in E[SS(a|u)] and E(SSE) (see Problem 13.11). In Sections 
13.5.1 and 13.5.2, we obtain the expected mean squares using matrix methods 
similar to those in Sections 13.4.1 and 13.4.2. 


13.5.1. Full—Reduced-Model Approach 


For the error term in Table 13.4, we have 


E(SSE) = E{y'[I — X(X'X) X'Jy} = ka — lo’, (13.37) 
which was proved in Theorem 12.3e(i). 


Using a full—reduced-model approach the sum of squares for the a’s adjusted for 
pis given by (13.25) as SS(a|) = y/X(X'X) X’y — y'[./kn) Jen ly. Thus 


1 
E[SS(a|4)] = Ely’X(X'X)-X’y] — ely (3m )9| (13.38) 


Using Theorem 5.2a, the first term on the right side of (13.38) becomes 


Ely’X(X'X)- X'y] = tr[X(X'X) X'o7 I] + (XB)/X(X'X)- X' (XB) 
= o’ tr[X(X’X)~X’] + B’X’X(X’X)X’XB 


= o tr[X(X’X)-X’] + B’X’XB [by (2.58)]. (13.39) 


By Theorem 2.13f, the matrix X(X’X)~X’ is idempotent. Hence, by Theorems 2.13d 
and 2.8c(v), we obtain 


tr[X(X/X)~ X’] = rank[X(X'X)~ X] = rank(X) = k. (13.40) 


13.5 EXPECTED MEAN SQUARES 353 


To evaluate the second term on the right side of (13.39), we use X’X in (13.7) and 
use PB’ = (u*, af, ..., a;) subject to }>, a = 0. Then 


k 11 1 
w 
1 1 0 0 : 
, ko * * ay 
BXXB=n(w*,o},...,4)] 1 0 1... 0 
a* 
1) 260 1 : 
w 
ay 
=n ku + So afsu* +a, ...,u* + ay 
ay, 
=F Kas Die aa 
= n( te? 448 D9i +a 
= kn? +n>- af”. (13.41) 


Hence, using (13.40) and (13.41), Ely’ X(X'X) X’y] in (13.39) becomes 


Ely'X(X'X)-X’y] = ko? + knw? + S~ a. (13.42) 


L 


For the second term on the right side of (13.38), we obtain 


ely (Gm) =otr (G3) + px’ (Sin )XB 


_ okn 
kn 


1 
=o + in (BX' jin) Gn XB)- (13.43) 


1 Ae 
ote kn BX jini XB 


354 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


Using X as given in (13.6), j;,,X becomes 


jn in O 0 ON) (HE 
de Oe tee ay 
jin XB fa Gis dn> eb Ae) : : 
bw 
ay 
=(kn,njn,...,n)}. (since jj, = 7”) 
ay, 


k 
= knp* + nS- a; = knp* (sine Ne a; = 0). 


i=1 


The second term on the right side of (13.43) is then given by 


1 ; : 1 ; kn? *2 : 
— (B'X' jin (Fy XB) = — XB = = knw, 
kn kn kn 
so that (13.43) becomes 
1 
E y (Fu )s| =o + kn”. (13.44) 
nN 


Now, using (13.42) and (13.44), E[SS(a|)] in (13.38) becomes 


k 
E[SS(a|w)] = ko? + kn? +nS~ a? — (o? + kn”) 


i=1 


=(k- Do? +n>_ af?. (13.45) 


13.5.2 General Linear Hypothesis 


To simplify exposition, we use k = 4 to illustrate results in this section, as was done in 
Section 13.4.2. It was shown in Section 13.4.2 that SSH = (CB [C(X’X)-C’]“'CB 
is the same as SS(a|) = >, y7/n — y* /kn in (13.23). Note that for k = 4, Cis 3 x 5 
[see (13.28)] and C(X’X)C’ is 3 x3 [see (13.30)]. To obtain E[SS(a|u)], 
we first note that by (12.44), (12.45), and (13.31), E(CB) = C8, cov(CB) = 
o° C(X'X)C’, and [C(X’X)-C’] | = ns — +J3). 


13.5 EXPECTED MEAN SQUARES 355 


Then, by Theorem 5.2a, we have 
E[SS(q|n)] = E{(CB) [C(X’X) CT 'CB} 
= tr{[C(X’X) CT !cov(CB)} + [E(CB)]'[C(X’X) CT) 'E(CB) 
= tr{[C(X’X) CT! o’ C(X’X) CC} + n(CB)' Is — 153]CB 
= o tr(ls) + nB'C (Uy — 5 53)CB 
= 30° +nB(CC -!1C JO 68. (13.46) 


Using C in (13.28), we obtain 


0 0 O O 90 
0 3 -1 -1 -!l 
CC=]0 -1 0 Of}, (13.47) 
0 -l 0 1 0 
0 -l 0 0 1 
0 0 O O 90 
0 9 -—3 -3 -3 
CuC=}]0 -3 1 1 1 (13.48) 
0 -3 1 1 1 
0 -3 1 1 1 


From (13.47) and (13.48), we have 


C'C —!1C',C =1 ACC - C'I0) 
0 0 0 0 


Oo 


—1 3. = 1 
—-1 -l 3-1 


oooocvcrcomeCclcCcCOUCUCcCOUCO 
o 


0 0 0 000 0 0 
0 0 0 Oa. Ta 
=} 04 0 0}/-;/0 1111 
00 4 0 a es ee ae 
00 0 4 Of tal 
(een ete a) 
lo b/s *ho 4 


356 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


Thus the second term on the right side of (13.46) is given by 


np(C'C —1C'I;0)8 
7 °(° a4 (° p 
ey i PO I 


= NM, A), A, 3, a a 
Me, Oy, Ay, Az, Ay 0 


1 ( * * * * (9 aD 
— 7n(p, A;, &>, Az, a 
Zh , A), A, Az, Ay 0% 


pb 
4 a 
= nS — aj -tr(0 Sa be So ai, Y«) a 
i=1 i i i i a’ 
4 
4 
=n s ar, 
i=1 
Hence, (13.46) becomes 
4 
E[SS(a|)] = 307 +n S at. (13.49) 


i=1 


This result is for the special case k = 4. For a general k, (13.49) becomes 


k 
E[SS(a|w)] = (k — Io? +S - a7”. 


i=l 


For the case in which PB’ = (pt, a1,...,a,%) is not subject to >>, a; = 0, see 
Problem 13.14. 


13.6 CONTRASTS 357 
13.6 CONTRASTS 


We noted in Section 13.2 that a linear combination ar cj;q@; in the a’s is estimable if 
and only if aa c; = 0. In Section 13.6.1, we develop a test of significance for such 
contrasts. In Section 13.6.2, we show that if the contrasts are formulated appropri- 
ately, the sum of squares for treatments can be partitioned into k — 1 independent 
sums of squares for contrasts. In Section 13.6.3, we develop orthogonal polynomial 
contrasts for the special case in which the treatments have equally spaced quantitative 
levels. 


13.6.1 Hypothesis Test for a Contrast 


For the one-way model, a contrast )°; cja;, where 5+, c; = 0, is equivalent to )7; cip; 


since 
So cam on So cut aj) = w>— cj + So cia = S > cia. 
i i i 


A hypothesis of interest is 
Ho: S > cia; = 0 or Ho: So cimi = 0, (13.50) 


which represents a comparison of means if }°; c; = 0. For example, the hypothesis 


Hy : 3b, — My — M3 — My = 0 
can be written as 
Ho: by = 4 (Ma + Bs + Ma), 


which compares 1, with the average of [2, 3, and py. 

The hypothesis in (13.50) can be expressed as Ho:c/B=0, where 
ec’ = (0,c1, c2, ..., ce) and B = (wu, a, ..., a)’. Assuming that y is Mj,(XB, o° I), 
Ho can be tested using Theorem 12.7c. In this case, we have m = 1, and the test 
statistic becomes 


_ (By le (XX) e}'e'B 


F 
SSE/k(n — 1) 
py 
= Sux (13.51) 
2 
(es cv:) 
a ere (13.52) 


— k ; 
oe c/n 


where s* = SSE/k(n — 1), and (X'X)~ and B are as given by (13.11) and (13.12). 
The sum of squares for the contrast is (¢’ py /¢'(X'X)e or n( 0; ci, /0 ue), 


358 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


13.6.2 Orthogonal Contrasts 
Two contrasts cp and cp are said to be orthogonal if ce; = 0. We now show that if 
Jp and cB are orthogonal, they are independent. Since we are assuming normality, 


cp and cB are independent if 
cov(e/B, ¢B) = 0 (13.53) 


(see Problem 13.16). By Theorem 12.3c, cov(e, B, cB) = oe)(X'X) cj. By (13.11), 
(X’X) = diag[0, (1/n),..., (1/n)], and therefore 


cov(e)B, ¢B) = e(X'X)G =0 if ce =0 (13.54) 


(assuming that the first element of c; is 0 for all 7). By an argument similar to that used 
in the proofs of Corollary 1 to Theorem 5.6b and in Theorem 12.7b(v), the sums of 
squares (cB) /e;(X'X) ¢; and (By / c(X'X) ¢ are also independent. Thus, if two 
contrasts are orthogonal, they are independent and their corresponding sums of 
squares are independent. 

We now show that if the rows of C (Section 13.4.2) are mutually orthogonal con- 
trasts, SSH is the sum of (cB) /¢,(X'X) ¢; for all rows of C. 


Theorem 13.6a. In the balanced one-way model, if y is NMg,(XB, 071) and if 


Ho: a, = a2 = ---: = ay is expressed as CB = 0, where the rows of 
/ 
c 
c, 
Cc => 
/ 
Ch] 


are mutually orthogonal contrasts, then SSH = (CB) [C(X'X)-C'}'CB can be 
expressed (partitioned) as 


k-1 Q2 
(¢,B) 
SSH = > CUX'X)¢;’ (13.55) 


where the sums of squares CA /e(X'X) ¢;,i = 1,2,...,k — 1, are independent. 


Proor. By (13.54), C(X’X)C’ is a diagonal matrix with ¢(X’'X)¢, 
i=1,2,...,k—1, on the diagonal. Thus, with (CB) = (c,B,¢5B,...,.¢,_,B). 
(13.55) follows. Since the rows c/, ¢5,..., ¢,_, of C are orthogonal, the indepen- 
dence of the sums of squares for the contrasts follows from (13.53) and (13.54). 


13.6 CONTRASTS 359 


An interesting implication of Theorem 13.6a is that the overall F for treatments 
(Table 13.1) is the average of the F statistics for each of the orthogonal contrasts: 


k-1 Rp? 
~p_SSH/K-D_ 1 5 ob) 


2 k— 144 8el(XX)¢; 


It is possible that the overall F would lead to rejection of the overall Hy while some of 
the F;’s for individual contrasts would not lead to rejection of the corresponding Ho’s. 
Likewise, since one or more of the F;’s will be larger than the overall F, it is possible 
that an individual Hp would be rejected, while the overall Ho is not rejected. 


Example 13.6a. We illustrate the use of orthogonal contrasts with the ascorbic acid 
data of Table 13.2. Consider the orthogonal contrasts 244, — b@, — M3; and pb, — M3. 
By (13.50), these can be expressed as 


2p — My — By = 2a) — ay — a3 = (0,2, -1, -1) B=) B, 
M2 a M3 = 2-43 = (0, 0, 1, —lp = cB. 
The hypotheses Ho; : cB = 0 and Hog : eB = 0 compare the first treatment versus 
the other two and the second treatment versus the third. 


The means are given in Table 13.2 as y; = 17.15, yy = 19.31, and y3 = 23.53. 
Then by (13.52), the sums of squares for the two contrasts are 


mS GH, _ 1207-15) — 19.31 — 23.537 


a = 85.0584 
SS; >, 2 rede 85.0584, 
7(19.31 — 23.53) 
SS2 = ( ) = 62.2872. 
1+1 
By (13.52), the corresponding F statistics are 
SS; 85.0584 SS2 62.2872 

== = 27.85 Fy => = = 20.40 
1's? ~ 3.0537 7 s% ~~ 3.0537 


where s? = 3.0537 is from Table 13.3. Both F, and F, exceed Fos1.18 = 4.41. The 
p values are .000051 land .000267, respectively. 

Note that the sums of squares for the two orthogonal contrasts add to the sum of 
squares for treatments given in Example 13.4; that is, 147.3456 = 85.0584 + 
62.2872, as in (13.55). 


The partitioning of the treatment sum of squares in Theorem 13.6a is always poss- 
ible. First note that SSH = y’Ay as in (13.29), where A is idempotent. We now show 
that any such quadratic form can be partitioned into independent components. 


360 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


Theorem 13.6b. Let y’ Ay be a quadratic form, let A be symmetric and idempotent 
of rank r, let N = kn, and let the N x 1 random vector y be Ny (XB, oI). Then there 
exist r idempotent matrices A;, Ao,..., A, such that A = ee A,;, rank(A,) = 1 for 
i= 1,2,...,r, and A;Aj = O for i ¥ j. Furthermore, y’Ay can be partitioned as 


y Ay = S_yAyy, (13.56) 
i=1 
where each y’A,y in (13.56) is x°(1, A;) and y’Avy and y’ Avy are independent for i 4 j 
(note that A; is a noncentrality parameter). 
Proor. Since A is N x N of rank r and is symmetric and idempotent, then by 
Theorem 2.13c, r of its eigenvalues are equal to | and the others are 0. Using the 
spectral decomposition (2.104), we can express A in the form 


A= Siw, = ys (13.57) 
i=l = 


where V), V2,..., V, are normalized orthogonal eigenvectors corresponding to the 
nonzero eigenvalues and A; = v;V;. It is easily shown that rank(A,) = 1, A;A; = O 
for i # j, and A; is symmetric and idempotent (see Problem 13.17). Then by 
Corollary 2 to Theorem 5.5 and Corollary 1 to Theorem 5.6b, y/Ajy is x°(1, Aj) 
and y’Ajy and y’Ajy are independent. 


If y’‘Ay in Theorem 13.6b is used to represent SSH, the eigenvectors correspond- 
ing to nonzero eigenvalues of A always define contrasts of the cell means. In other 
words, the partitioning of y’Ay in (13.56) is always in terms of orthogonal contrasts. 
To see this, note that 


SST = SSH + SSE, 


which, in the case of the one-way balanced model, implies that 


1 k 
y’ (1 = a3)y = SCyAy + y'[I— X(X’X) X’ly. (13.58) 
kn i=l 
If we let 
J 0 0 
0 J 0 
K=-— 
n : 
0 0 J 


as in (13.26), then (13.58) can be rewritten as 
1 k 
vy=y _Jy+ > ly wpy + yd — Ky. (13.59) 
uh i=l 


By Theorem 2.13h, each v; must be orthogonal to the columns of (1/n)J and 
I — K. Orthogonality to (1/n)J implies that v,j = 0; that is, v; defines a contrast 


13.6 CONTRASTS 361 


in the elements of y. Orthogonality to I — K implies that the elements of vy; cor- 
responding to units associated with a particular treatment are constants. Together 
these results imply that v; defines a contrast of the estimated treatment means. 


Example 13.6b. Using a one-way model, we demonstrate that orthogonal contrasts 
in the treatment means can be expressed in terms of contrasts in the observations and 
that the coefficients in these contrasts form eigenvectors. For simplicity of exposition, 
let k = 4. The model is then 


Yg=Mtatez; i= 1,2,3,4, j=1,2,...,n. 
The sums of squares in (13.59) can be written in the form 


yy = SS(w) + SS(a|m) + SSE 
Lee y - 
==+(BXy-=)+’y—- BX’). 
kn kn 


With k = 4, the sum of squares for treatments, y’Ay = Bx'y = y /4n, has 3 degrees 
of freedom. Any set of three orthogonal contrasts in the treatment means will serve to 
illustrate. As an example, consider cB = (0, 1, —1, 0,0), ¢5B = (0, 1,1, —2, 0)B, 
and ¢58 = (0,1, 1,1, —3)B, where B = (yu, a1, a2, a3, a4)’. Thus, we are comparing 
the first mean to the second, the first two means to the third, and the first three to the 
fourth (see a comment at the beginning of Section 13.4 for the equivalence of 
Ho: a) = ay = a3 = a4 and Ho: my = My = Mz = My). Using the format in (13.55), 
we can write the three contrasts as 


(Bo _.-y 
Je(x’xy ce = /2/n 

5B _ VW. +92, — 29, 
Ves(X'X) en V6/n 

eB V1, + Yo, + V3, — 3Y 4, 


Je(X'X)¢3 J12/n 
where (X’X)~ = diag[0, (1/n),..., (1/n)] is given in (13.11) and B = (0, Yy5---5 94), 


is from (13.12). 
To write these in the form vy, v5y, and v4y [as in (13.59)] we start with the first: 


ss Ny Do Ny 
Wo 1 fe A” 
2/n /2/n n n 


— I/n 


— /2/n 


J 
= viy> 


Dinghy Ale Hg ine Oi, Oy 


362 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


where the number of Is is n, the number of — 1s is n, and the number of Os is 2n. Thus 
vi — (1/ V 2n)(j,,, es in 0’, 0’), and 


viv = =) 

Similarly, v, and v, can be expressed as v, = (1/vV6n)(ji,, j/, — 2j/,,0') and 
v3 = (1/V12n)(j,,, j,i, —37,,). We now show that v;, v2, and v3 serve as eigenvec- 
tors in the spectral decomposition [see (2.104)] of the matrix A in SS(a|) = y/Ay. 
Since A is idempotent of rank 3, it has three nonzero eigenvalues, each equal to 1. 
Thus the spectral decomposition of A is 


/ / J 
A=vjV, + V2V, + V3V3 


jn jn 
1 jn o/ o/ a a 1 jn of of o/ , 
ae , -j,0,0)4+ — Jn — 25,9 
ml 9 [Ge dn Velie = Gas Jn> — 25, 9) 
0 0 
jn 
1 jn e/ e/ sf o/ 
ahats —3 
om) 4. Gn Jnedn> —3in) 
—3j, 

J, —I, O O J J, —23, O 
_1]-u J, O O 1 J, b -2T,: O 
~ 2n} O Oo OO 6n | —23, —23, 43, O 

O Oo OO O O Oo O 

Ji Jn Jn —33n 
1 Ii 5; i, 235; 


which is the matrix of the quadratic form for SS(a@|) in (13.27) with k= 4. 
For SS(w) = y?/4n, we have 


2 . s/ 
Yr f Nandan \ pr 2 
ap y( is )y (voy), 


13.6 CONTRASTS 363 


where vy = ji,,/2V/n. It is easily shown that vpvo = 1 and that viv; = 0. It is also 
clear that vo is an eigenvector of j,,j/,,/4n, because j,,j/,,/4n has one eigenvalue 
equal to | and the others equal to 0, so that j,,,j/,,,/4n is already in the form of a spec- 


tral decomposition with j,,,/2,/n as the eigenvector corresponding to the eigenvalue 
1 (see Problem 13.18b). 


13.6.3 Orthogonal Polynomial Contrasts 


Suppose the treatments in a one-way analysis of variance have equally spaced quan- 
titative levels, for example, 5, 10, 15, and 201b of fertilizer per plot of ground. The 
researcher may then wish to investigate how the response varies with the level of fer- 
tilizer. We can check for a linear trend, a quadratic trend, or a cubic trend by fitting a 
third-order polynomial regression model 


Vig = Bo + Bixi + Boxe + Baxp + ey, (13.60) 


i= 1,2,3,4, j=1,2,...,n, 


where x; = 5, x2 = 10, x3 = 15, and x4 = 20. We now show that tests on the 6's in 
(13.60) can be carried out using orthogonal contrasts on the means y, that are esti- 
mates of yu; in the ANOVA model 


Yj =Mtatej=pj,tej; 1=1,2,3,4, j= 1,2,...,n. (13.61) 
The sum of squares for the full—reduced-model test of Hp : B; = 0 is 
BX'y — B/Xiy. (13.62) 


where B is from the full model in (13.60) and Bi is from the reduced model with 
B; = 0 [see (8.9), (8.20), and Table 8.3]. The X matrix is of the form 


oF <3 
Lox, xy xy 


x2 bes ot 

lm 6 x3 

= 2 9 
X= ee Ie (13.63) 

1 x3 x3 x3 

X3 Ee x 

x4 rh a 


364 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 
For testing Ho : 8B; = 0, we can use (8.37) 


BX'y — BY Xiy 


F= 
g2 


> 


or (8.39) 


B 
ate (13.64) 


where X; consists of the first three columns of X in (13.63), s* = SSE/(n — 3 — 1), 
and g33 is the last diagonal element of (X’X)!. We now carry out this full—reduced- 
model test using contrasts. 

Since the columns of X are not orthogonal, the sums of squares for the ’s ana- 
logous to B; /g33 in (13.64) are not independent. Thus, the interpretation in terms 
of the degree of curvature for E(yj) is more difficult. We therefore orthogonalize 
the columns of X so that the sums of squares become independent. 

To simplify computations, we first transform x; = 5, x2 = 10,x3 = 15, and 
x4 = 20 by dividing by 5, the common distance between them. The x’s then 
become x; = 1, x. = 2, x3 = 3, and x4 = 4. The transformed 4n x 4 matrix X in 
(13.63) is given by 


tf oP 
Ae 42 
pee, 


1 3 ; 
x= 1 3 32 33 = G. Xi, X2, X3), 


4 42 2B 
144 4 


where jis 4n x 1. Note that by Theorem 8.4c, the resulting F statistics such as (13.64) 
will be unaffected by this transformation. 

To obtain orthogonal columns, we use the orthogonalization procedure in 
Section 7.10 based on regressing columns of X on other columns and taking 
residuals. We begin by orthogonalizing x,;. Denoting the first column by xX, we use 


13.6 CONTRASTS 365 
(7.97) to obtain 
X1.9 = X1 — Xo(XpXo) XX 
4 
=x) — jG) Fx = x1 — Jn ny Ox 
i=l 
= x, — xj. (13.65) 
The residual vector x;.9 is orthogonal to xo = j: 
{X10 = /@ — Xj) = jx: — xj/j = 4nx — 4nx = 0. (13.66) 


We apply this procedure successively to the other two columns of X. To transform the 
third column, x5, so that it is orthogonal to the first two columns, we use (7.97) to 
obtain 


X01 = X2 — Zi (Zi Zi) 'Z}x, (13.67) 
where Z; = (j, X1.0). We use the notation Z; instead of X; because xj.9, the second 


column of Z, is different from x,, the second column of X;. The matrix Zi Zi iS 
given by 


1 Ay: 
Zizi = € Ja X1.0) 


/ 
1-0 


jj 0 
= ; [by (13.66)], 
0 X}.9X1-0 


and (13.67) becomes 


X01 = X2 — Zi (Zi Zi) |Z x2 


ese —l o/ 

. ij 0 j 
=X) — (j, xX). x 
— (5 raed) (v.,) : 


(13.68) 


The residual vector X2.9; is orthogonal to xg = j and to x;.0: 


j/X201 = 0, X).0X201 = 0. (13.69) 


366 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


The fourth column of Z becomes 


e/ / / 
JX. X}.9X3 Xp.) X3 

B.012 = X3->Z-J-s X10 -F X2.015 (13.70) 
JJ X}.9X1-0 X5.9; X2-01 


which is orthogonal to the first three columns, j, X1.9, and X2.91. 
We have thus transformed y = XB + € to 


y=Z0+e, (13.71) 


where the columns of Z are mutually orthogonal and the elements of @ are functions 
of the B’s. The columns of Z are given in (13.65), (13.68), and (13.70): 


D=j 4=X10, 2=X201, 2 = X3.012. 


We now evaluate z,, Z, and z3; for our illustration, in which 
x; = 1, % = 2, x3 = 3, and x4 = 4. By (13.65), we obtain 


Z=X1.0 =X) — Xj = x1 — 2.5j 


= (-1.5, ..., -1.5, —.5, ..., =.5, 5,004; 5, 1.5, 0.05 1.5), 


which we multiply by 2 so as to obtain integer values: 


Bi Sy ESB yrs = Bats Lede Ne hag Wp eae 2B) (13.72) 


Note that multiplying by 2 preserves the orthogonality and does not affect the F 
values. 
To obtain z2, by (13.68), we first compute 


os ee As ee 30_ 45 
ji 4n - . ha * Sade 


XoX2 _ n[—3(17) — 127) + 167) + 3(4)] _ 50 


= 2.5. 
X}.9X1.0 n[{(—3)° + (-1) + 12 + 3] 20 


Then, by (13.68), we obtain 


*/ 
JX. 


/ 
X,.9X2 


Z2 = X2— efe 
JJ 


/ 
X).oX1-0 


=X- 75j = 2.5X1.9 


1? 1 
1? 1 
2? 1 
2? 1 
= — 7.5 —2.5 
Bf 1 
3° 1 
4? 1 
4? 1 
Similarly, using (13.70), we obtain 
z =(-l, ...,-1,3,...,3, -3, 
Thus Z is given by 
1 -3 1 
1 -3 1 
1 -1 -1l 
1 -1 -1I 
ti 1 1 -1 
1 1 - 
1 3 
1 3 1 


X1.0 


13.6 CONTRASTS 


=3 1 
—3 1 
—1 —1 
—1 z —1 
1} |-1 
1 —1 
3 1 
3 1 
3; 1, £1) 
—1 
—1 
3 
3 
—3 
—3 
1 
1 


367 


(13.73) 


(13.74) 


368 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 
Since 


XB = ZO, 


we can find the @’s in terms of the f’s or the #’s in terms of the 6’s. For our illus- 
tration, these relationships are given by (see Problem 13.24) 


16.7 
Bo = 9 — 50; + 562 — 3563, B, = 20) — %h +—- 63, 
: (13.75) 


6 
By = 0 — 2503, By =>: 


Since the columns of Z = (j, Z1, Z2, Z3) are orthogonal (ZZ; = 0 for alli 4 7), we 
have Z'Z = diag(j'j, z,Z1, 2422, 2423). Thus 


iy/ii 
@= (ZZ) 'Zly = my/B 01 | (13.76) 


/ / 
Dy [ZZ 

/ / 
Ly /2323 


The regression sum of squares (uncorrected for 6) is 


3 Ie\2 
Z. Z. 
SS(0) = O'Z'y = ye , (13.77) 
i=0 7" 


where Zp) = j. By an argument similar to that following (13.54), the sums of squares 
on the right side of (13.77) are independent. 

Since the sums of squares SS(6;) = (ayy /az;, i = 1,2,3, are independent, each 
SS(6;) tests the significance of 6; by itself (regressing y on z; alone) as well as in 
the presence of the other 6,’s: that is, for a general k, we have 


SS(j| 00, ---, G1, Oi41,-- +, O%) = SS(G,.--, %) — SSO, -- +, Gi-1, Fi41,- ++, OH) 


k 


= (ayy (ayy 


/ . / . 
0 94 Fei BF 


= $S(6)). 


(iy 
ly 


{el 


In terms of the B's, it can be shown that each SS(6;) tests the significance of B; in 
the presence of Bos Bi Laie B;_ ,- For example, for 6; (the last 8), the sum of squares 


13.6 CONTRASTS 369 


can be written as 


2 
(ZY) 
Zi, Lk 


SS(@) = = BX'y — By X\y (13.78) 


(see Problem 13.26), where B is from the full model y = XB + e and Bi is from the 
reduced model y = X; Bj + €, in which B, contains all the B’s except B, and X, con- 
sists of all columns of X except the last. 

The sum of squares SS(0;) = (zy) /az; is equivalent to a sum of squares for a 
contrast on the means y, , y2,..., ¥, aS in (13.52). For example 


ZY = —3y1 — 3y12 — +++ — 3Y1n — Yat — ++ — Yon 
+ ysi tess + Yan + 3ya1 + +++ + 3 4n 


n n n n 
3) yu Doyat Dy +3 yy 
j=l j=l j=l j=l 


= —3y), — yo. + y3. + 3ya, 
= n(—3y,, — Yo, + ¥3, + 394) 
4 


| nS ~ cy; 


i=1 


where c; = —3, cp = —1, c3 = 1, and cq = 3. Similarly 


ZZ, = n(—3Y + n(-1) + n(1)P + BY 
Snhs3yP $1 4 137] 


4 
=n ) C. 


i=l 
Then 


(zy) _@ eee ci) ni ee, ci) 


/ 4 2 4 9 
ZZ) Nia Gi eG 


’ 


which is the sum of squares for the contrast in (13.52). Note that the coefficients 
—3, —1, 1, and 3 correspond to a linear trend. 
Likewise, z,y becomes 


ZY = (V1, — Yo, — V3, + ¥4,)s 


370 ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


whose coefficients show a quadratic trend, and z,y can be written as 


Zy = n(—Y,, + 392, — 393, + Ya) 


with coefficients that exhibit a cubic pattern. 

These contrasts in the y;’’s have a meaningful interpretation in terms of the shape 
of the response curve. For example, suppose that the y, ’s fall on a straight line. Then, 
for some by and b;, we have 


yy, = bo + bx; = bo + bhi, i= 1,2,3,4, 


since x; = i. In this case, the linear contrast is nonzero and the quadratic and cubic 
contrasts are zero: 


Sy Sa Nae = 
—3(bo + bi) — (bo + 2b1) + bo + 3b, + 3(bo + 461) = 10b1, 
bo + by — (bo + 2b1) — (bo + 3b1) + (bo + 4b1) = 0, 
—(bp + b}) + 3(b9 + 2b1) — 3(b9 + 3b1) + (bo + 4b)) = O. 


This demonstration could be simplified by choosing the linear trend 
¥1, = 1, ¥2, = 2, 3, = 3, and yy = 4. 
Similarly, if the y,;’s follow a quadratic trend, say 


y= 1, y= 2, 3, = 2, 4. = 1, 


then the linear and cubic contrasts are zero. 

In many cases it is not necessary to find the orthogonal polynomial coefficients by 
the orthogonalization process illustrated in this section. Tables of orthogonal 
polynomials are available [see, e.g., Rencher (2002, p. 587) or Guttman (1982, 
pp. 349-—354)]. We give a brief illustration of some orthogonal polynomial coeffi- 
cients in Table 13.5, including those we found above for k = 4. 


TABLE 13.5 Orthogonal Polynomial Coefficients for k = 3, 4, 5 


Linear = 0 1 F231. 31. I> apr. 2 0 124.2 
Quadratic 1 —2 1 1 —1 -1 1 2 = =2. =—-1 2 
Cubic -1 3 -3 1 -1 2 0 -2 1 
Quartic 1 —4 6 -4 1 


PROBLEMS 371 


In Table 13.5, we can see some relationships among the coefficients for each value 
of k. For example, if k = 3 and the three means y, , y., y3, have a linear relationship, 
then y, — y, is equal to y, — y, ; that is 
Veo = Ie 
or 


Vx Ja Oe. +2) =U, 
Ys, = 292, + Ni, = 0. 
If this relationship among the three means fails to hold, we have a quadratic com- 
ponent of curvature. 
Similarly, for k = 4, the cubic component, —y, + 3y. — 3y3 + y4., is equal to the 
difference between the quadratic component for y, , y, , y3. and the quadratic com- 
ponent for y2 , V3, ¥4.: 


— YW. + 32, — 3Y3. + ¥4. = Yo, — 293, + ¥4. — O1, — 2Y2, + ¥3). 


PROBLEMS 


13.1 Obtain the normal equations in (13.7) from the model in (13.6). 
13.2 Obtain B in (13.12) using (X’X)~ in (13.11) and X’y in (13.7). 


13.3. Show that SSE = y'[I — X(X'X)X’Jy in (12.21) is equal to SSE = i Yi — 
>>, y7 /n in (13.13). 
13.4 Show that the expressions for SSE in (13.13) and (13.14) are equal. 


13.5 (a) Show that Hop:a,; =a, =-:-=a, in (13.17) is equivalent to 
Ho: aj = a} =--- = af in (13.18). 
(b) Show that Hy:aj=a,=---=a; in (13.18) is equivalent to 
Ho: a) = a3 = +++ = a = 0 in (13.19). 
13.6 Show that ny (y, —¥.)° in (13.24) is equal to 37, y?/n— y?/kn in 
(13.23). 


13.7 Using (13.6) and (13.11), show that X(X'X)~ X’ in (13.25) can be written in 
terms of J and O as in (13.26). 


13.8 Show that for C in (13.28), C(X’X) C’ is given by (13.30). 
13.9 Show that C(X’X)~X’ is given by the matrix in (13.32). 
13.10 Show that the matrix (1 /4n)A’J3A in (13.33) has the form shown in (13.35). 


372 


13.11 


13.12 
13.13 
13.14 


13.15 
13.16 


13.17 


13.18 


13.19 
13.20 
13.21 
13.22 


13.23 
13.24 


13.25 
13.26 
13.27 


ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


Using the model yj = u*+a;+ 6, with the assumptions E(ej) = 
0, var(e) = 0°, cov(e;, 7) =0, and the side condition yey a; = 0, 
obtain the following results used in Table 13.4: 

(a) E(e;) = o for all i, j and E(eyer7) = 0 for i,j A 7,7. 

(b) E[SS(o|u)] = (k — Io? +n Th, a”. 

(c) (SSE) = k(n — 1)o?. 

Using C in (13.28), show that C’C is given by the matrix in (13.47) 

Show that C’J;C has the form shown in (13.48). 


Show that if the constraint Ser a; = 0 is not imposed, (13.49) becomes 


4 
E[SS(a|)] = 30° +4S° (aj — @. 
i=l 
Show that F in (13.52) can be obtained from (13.51). 


Express the sums of squares (c’ py /¢(X'X)¢; and (cpy / (XX) ¢ below 
(13.54) in Section 13.6.2 as quadratic forms in y, and show that these sums of 
squares are independent if cov(e’B, cB) = 0 as in (13.53). 


In the proof of Theorem 13.6b, show that A; is symmetric and idempotent that 
rank(A;) = 1, and that A;A; = O. 


(a) Show that J/kn in the first term on the right side of (13.59) is idempotent 
with one eigenvalue equal to | and the others equal to 0. 


(b) Show that j is an eigenvector corresponding to the nonzero eigenvalue of 


J/kn. 
In Example 13.6b, show that vovo = 1 and vv; = 0. 
Show that j/x2.9; = 0 and X.1X2.01 = 0 as in (13.69). 
Show that x3.912 has the form given in (13.70). 


Show that x3.912 is orthogonal to each of j, x1.0, and X2.91, as noted following 
(13.70). 


Show that z; = (-l,..., -I,3,...,3, —3,---, 3, 1,---, 1 as in (13.74). 


Show that Bo = A = 50, + 502 = 3503, B, => 20; = 502 + (16.7/.3)63, 
By = 0) — 2563, and B; = 63/.3, as in (13.75). 


Show that the elements of 6 = (Z/Z)~! Z’y are of the form zy /z/z; as in (13.76). 
Show that SS(@) = B'X'y — B’X'y as in (13.78). 


If the means y,,y.,¥3, and y, have the quadratic trend y, = 1, 
¥. = 2, y3 = 2, y, = 1, show that the linear and cubic contrasts are zero, 
but the quadratic contrast is not zero. 


13.28 


13.29 


13.30 


13.31 


PROBLEMS 373 


TABLE 13.6 Blood Sugar Levels (mg/100 g) for 10 
Animals from Each of Five Breeds (A—E) 


A B C D E 
124 111 117 104 142 
116 101 142 128 139 
101 130 121 130 133 
118 108 123 103 120 
118 127 121 121 127 
120 129 148 119 149 
110 122 141 106 150 
127 103 122 107 149 
106 122 139 107 120 
130 127 125 115 116 


Blood sugar levels (mg/100g) were measured on 10 animals from each of 
five breeds (Daniel 1974, p. 197). The results are presented in Table 13.6. 


(a) Test the hypothesis of equality of means for the five breeds. 
(b) Make the following comparisons by means of orthogonal contrasts: 


A, B, C, vs.D, E; A, B, vs.C; Avs. B; D vs. E. 


In Table 13.7, we have the amount of insulin released from specimens of 

pancreatic tissue treated with five concentrations of glucose (Daniel 1974, 

p. 182). 

(a) Test the hypothesis of equality of means for the five glucose 
concentrations. 

(b) Assuming that the levels of glucose concentration are equally spaced, use 
orthogonal polynomial contrasts to test for linear, quadratic, cubic, and 
quartic trends. 


A different stimulus was given to each of three groups of 14 animals (Daniel 
1974, p. 196). The response times in seconds are given in Table 13.8. 


(a) Test the hypothesis of equal mean response times. 


(b) Using orthogonal contrasts, make the two comparisons of stimuli: | 
versus 2, 3; and 2 versus 3. 


The tensile strength (kg) was measured for 12 wires from each of nine cables 

(Hald 1952, p. 434). The results are given in Table 13.9. 

(a) Test the hypothesis of equal mean strengths for the nine cables. 

(b) The first four cables were made from one type of raw material and the 
other five from another type. Compare these two types by means of a 
contrast. 


374 


ONE-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


TABLE 13.7 Insulin Released at Five Different Glucose 


Concentrations (1-5) 


1 2 3 4 5 
1.53 3.15 3.89 8.18 5.86 
1.61 3.96 4.80 5.64 5.46 
3.75 3.59 3.69 7.36 5.96 
2.89 1.89 5.70 5.33 6.49 
3.26 1.45 5.62 8.82 7.81 
2.83 3.49 5.79 5.26 9.03 
2.86 1.56 4.75 8.75 7TA9 
2.59 2.44 5.33 7.10 8.98 
TABLE 13.8 Response Times (in seconds) to Three 
Stimuli 

Stimulus Stimulus 
1 2 3 1 2 3 
16 6 8 17 6 9 
14 ip 10 7 8 11 
14 Fi 9 17 6 11 
13 8 10 19 4 9 
13 4 6 14 9 10 
12 8 7 15 5 9 
12 9 10 20 5 2) 


TABLE 13.9 Tensile Strength (kg) of Wires from Nine Cables (1-9) 


1 2 3 4 5 6 7 8 9 
345 329 340 328 347 341 339 339 342 
327 327 330 344 341 340 340 340 346 
335 332 325 342 345 335 342 347 347 
338 348 328 350 340 336 341 345 348 
330 337 338 335 350 339 336 350 355 
334 328 332 332 346 340 342 348 351 
335 328 335 328 345 342 347 341 333 
340 330 340 340 342 345 345 342 347 
333 328 335 337 330 346 336 340 348 
335 330 329 340 338 347 342 345 341 


13.32 


13.33 


PROBLEMS 375 


TABLE 13.10 Scores for Physical Therapy Patients 
Subjected to Four Treatment Programs (1-14) 


1 2 3 4 
64 76 58 95 
88 70 74 90 
72 90 66 80 
80 80 60 87 
79 75 82 88 
71 82 75 85 


TABLE 13.11 Weight Gain of Pigs Subjected to Five 
Treatments (1-5) 


1 2 3 4 5 
165 168 164 185, 201 
156 180 156 195 189 
159 180 156 195 189 
159 180 189 184 173 
167 166 138 201 193 
170 170 153 165 164 
146 161 190 175 160 
130 171 160 187 200 
151 169 172 177 142 
164 179 142 166 184 
158 191 155 165 149 


Four groups of physical therapy patients were given different treatments 
(Daniel 1974, p. 195). The scores measuring treatment effectiveness are 
given in Table 13.10. 


(a) Test the hypothesis of equal mean treatment effects. 


(b) Using contrasts, compare treatments 1, 2 versus 3, 4; | versus 2; and 
3 versus 4. 


Weight gains in pigs subjected to five different treatments are given in 
Table 13.11 (Crampton and Hopkins 1934). 
(a) Test the hypothesis of equal mean treatment effects. 


(b) Using contrasts, compare treatments 1, 2, 3 versus 4; 1, 2 versus 3; and 
1 versus 2. 


14 Two-Way Analysis-of-Variance: 
Balanced Case 


The two-way model without interaction has been illustrated in Section 12.1.2, 
Example 12.2.2b, and Section 12.8. In this chapter, we consider the two-way 
ANOVA model with interaction. In Section 14.1 we discuss the model and attendant 
assumptions. In Section 14.2 we consider estimable functions involving main effects 
and interactions. In Section 14.3 we discuss estimation of the parameters, including 
solutions to the normal equations using side conditions and also using a generalized 
inverse. In Section 14.4 we develop a hypothesis test for the interaction using a full— 
reduced model, and we obtain tests for main effects using the general linear hypoth- 
esis as well as the full—reduced-model approach. In Section 14.5 we derive expected 
mean squares from the basic definition and also using a general linear hypothesis 
approach. Throughout this chapter we consider only the balanced two-way model. 
The unbalanced case is covered in Chapter 15. 


14.1 THE TWO-WAY MODEL 
The two-way balanced model can be specified as follows: 


Vik = M+ a + B+ Vy + Fix (14.1) 
ES 1, Qyecigds Jel 234..5by ke 1 2Qj. gn. 


The effect of factor A at the ith level is a;, and the term £; is due to the jth level of 
factor B. The term yj; represents the interaction AB between the ith level of A and the 
jth level of B. If an interaction is present, the difference a; — a2, for example, is not 
estimable and the hypothesis Hp: a; = a2 =--: = a, cannot be tested. In Section 
14.4, we discuss modifications of this hypothesis that are testable. 

There are two experimental situations in which the model in (14.1) may arise. In 
the first setup, factors A and B represent two types of treatment, for example, various 
amounts of nitrogen and potassium applied in an agricultural experiment. We apply 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


377 


378 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


each of the ab combinations of the levels of A and B to n randomly selected exper- 
imental units. In the second situation, the populations exist naturally, for example, 
gender (males and females) and political preference (Democrats, Republicans, and 
Independents). A random sample of n observations is obtained from each of the ab 
populations. 

Additional assumptions that form part of the model are the following: 


1. E(eyx) = 0 for all i, j, k. 

2. var(ejjx) = o? for all i, j,k. 

3. COV(Ejx, Ers¢) = O for G, j, k) A (r, 5, 2). 

4. Another assumption that we sometimes add to the model is that ej, is N(O, a”) 
for all i, j, k. 


From assumption 1, we have EQyix) = my = w+ a% + Bi + Yj, and we can 
rewrite the model in the form 


Yijk = Maj + Eijk, (14.2) 
Poe as Pas be PSO 


where by = E(yijx) is the mean of a random observation in the (i)th cell. 
In the next section, we consider estimable functions of the parameters a;, B;, 
and yj. 


14.2 ESTIMABLE FUNCTIONS 


In the first part of this section, we use a = 3, b = 2, and n = 2 for expositional pur- 
poses. For this special case, the model in (14.1) becomes 


Vik = M+ a+ Be+yytep, 1=1,2,3, fHl,2, k= 1,2. (14.3) 


The 12 observations in (14.3) can be expressed in matrix form as 


cain 1/1 0 0/1 0J/100000 mn i 
yu 1/1 0 0/1 0} 1 000 0 0 ay E112 
yi21 1/1 0 0/0 1/0 1000 0 a2 E121 
Y122 1/1 00/0 1/0 1000 0 a3 E122 
yout 1/0 1 0/1 0/0 0 10 0 0 B, E211 
yu2 | | 1/0 1 OF; 1 O0J0 0 1000 B, E212 
SN PO: 100: A046 40 205) lan | Peer ee 
222 1/0 10/0 1/0 0010 0 V12 £222 
yt 1/0 0 1/1 0/0 0001 Of] y, £311 
ya1 1/0 0 1/1 0/0 0001 Of] yw Ya10 
yao1 1/00 1/0 1100000 1]| y, 2551 
ya00 110 0 110 1100000 1) \y5 £30 


14.2. ESTIMABLE FUNCTIONS 379 


or 
y=XBre, 


where y is 12 x 1, X is 12 x 12, and Bis 12 x 1. (If we added another replication, so 
that n = 3, then y would be 18 x 1, X would be 18 x 12, but 6 would remain 12 x 1.) 
The matrix X’X is given by 


x’X = (14.5) 


4 
4 
6 
6 
2 
2 
2 
2 
2 
2 


ON ONONGPOA YI 
NoOoNnNONMODO I 
coocoocoongone°onw 
cCcCoOCOOoOoNONOPOCN 
COON OCOCODPDNPNO 
CONC CONOPNOS 
onococoCoOoPN I 


The partitioning in X’X corresponds to that in X in (14.4), where there is a column for 
pb, three columns for the three a’s, two columns for the two B’s, and six columns for 
the six y’s. 

In both X and X’X, the first six columns can be obtained as linear combinations of 
the last six columns, which are clearly linearly independent. Hence rank(X)= 
rank(X’X) = 6 [in general, rank(X) = ab]. 

Since rank(X) = 6, we can find six linearly independent estimable functions of the 
parameters (see Theorem 12.2c). By Theorem 12.2b, we can obtain these estimable 
functions from Xf. Using rows 1, 3, 5, 7, 9, and 11 of E(y) = XB, we obtain 
EQ) = By = B+ 4 + B+ yj for i= 1, 2, 3 and j= 1, 2: 


By =H=hMt+ayt+B+yp Me=KM+A+A+ V2 
Mo = Kh+ 02+ By t+ Yr, Ho = M+ 024+ Bo t+ 20 (14.6) 
3) = Kh+034+ Bi t+ ¥31, Wap = M+ 034+ Po + 30. 


These can also be obtained from the last six rows of X'XB (see Theorem 12.2b). 
By taking linear combinations of the six functions in (14.6), we obtain the follow- 
ing estimable functions (e.g., 0; = fy, — Mo; and 6) = py. — Mo): 


My =M+a+ Bit Ny 
=A, -a2+%1—% OF OY =a — + V2 - Yr 
0 =a -— a3 +%,— 31 OF O05 = a1 — 03+ Y2 — V2 
0 =B)-—Bo+%1-%2 OF 03 = B) — By + 21 — Y 


(14.7) 


380 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


or 63= 8, — By + Y31 — Y32 
04 = V1 — Vio — Ya1 + Y22 
95 = V1 — Vio — Y31 + Y32- 


The alternative expressions for 64, and @5 are of the form 
WV Yyt hp bY=123, BF=12 txt, FAs. (14.8) 


[For general a and b, we likewise obtain estimable functions of the form of (14.7) 
and (14.8).] 

In 64 and 6s in (14.7), we see that there are estimable contrasts in the y,;’s, but in 
61, 62, and 63 (and in the alternative expressions 6,’, 02’, 63’, and 63") there are no 
estimable contrasts in the a’s alone or B’s alone. (This is also true for the case of 
general a and Db.) 

To obtain a single expression involving a; — a for later use in comparing the a 
values in a hypothesis test (see Section 14.4.2b), we average 6, and 0)’: 


(0) + 0) = ay — a2 +3(M%11 + M2) — 3 (M1 + Yn) 
=a-aH+y.-%.- (14.9) 


For a; — a3, we have 


5 (02 + 0) = a — a3 +311 + V2) — 3. V1 + V9) 
=a—-aa+y. — 73.- (14.10) 


Similarly, the average of 03, 6;, and 64 yields 


5 (03 + 05 + 3) = By — Bo +43(%1 + Yar + Yar) — (M2 + V2 + Y2) 
= PB, — Bo. + ¥1- ¥2- (14.11) 


From (14.1) and assumption | in Section 14.1, we have 


EV) = Ele + a5 + B+ Vy + ei), 
P= 15.2) 04¢5 a; Fy Dears ocD3 k= 1,.2..2:.50 


or 


14.2. ESTIMABLE FUNCTIONS 381 


[see also (14.2) and (14.6)]. In Section 12.1.2, we demonstrated that for a simple 
additive (no-interaction) model the side conditions on the a’s and £’s led to 
redefined a*’s and B*’s that could be expressed as deviations from means, for 
example, af = jf; — f.. We now extend this formulation to an interaction model 
for mj: 


My = B+ (My, — BL) + Cy — B+ ag — Bi — By FD 
: .: , (14.13) 
= B+ a + Bi + Vip 
where 
we =f, a; = h-hh, B; a Bj —-, 
- - 7 (14.14) 
Vij = Mi — Mi. — By OB... 
With these definitions, it follows that 
a b 
Sra=0, Sg =o, 
i=l j=l 
S> ¥; = 0 for all j=1,2,...,b, (14.15) 


i=l 


b 
Sy S0° forall @S1j2,.22,4: 
j=l 


Using (14.12), we can write a7, B;. and Vij in (14.14) in terms of the original 
parameters; for example, a; becomes 


Bi ads «te, 1 
a= Mi. B= 5 Mi ab HH 


1 1 
=5 d+ ait B+ m—-Z> wt at Bt yy) 
j i 


j 
Jy (amm +048 +m) 


=pta+B+y,-bw-a-B-Y¥ 
=aj-a+y, — ¥.. (14.16) 


382 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 
Similarly 
B=B-B.+%j-%., (14.17) 


Vj= WoW — Wty (14.18) 


14.3. ESTIMATORS OF A’B AND o* 


We consider estimation of estimable functions A’B in Section 14.3.1 and estimation 
of o° in Section 14.3.2. 


14.3.1 Solving the Normal Equations and Estimating A’B 


We discuss two approaches for solving the normal equations X’'xXB = X’y and for 
obtaining estimates of an estimable function A’ B. 


14.3.1.1 Side Conditions 
From X and y in (14.4), we obtain X’y for the special case a = 3, b = 2, and n = 2: 


XY = 1 Vhs V9 V3.2 Voce V2.9 V1» Y12.5 Y21-9 Y22.5 Y31.9 32.) « (14.19) 


On the basis of X’y in (14.19) and X’X in (14.5), we write the normal equations 
X'XB = X’y in terms of general a, b, and n: 


b b 
abnj. + bie, +anS~ B+ Spy eae ae 
i=l j=l 


i=l j=l 


b 
bnju + bnds + nS > Bp +nS Hy =Yiw T= 12% 
j=l 


a a 
anjutn —~ &+anBj +n >~ ¥=yj. J=1,2, ...,5, 


i=1 i=l 
nj + nas + mB, + ny =Yys T= 1,2, 
J=1,2,...,b. (14.20) 


14.3 ESTIMATORS OF A’B AND o® 383 


With the side conditions >; @; = 0, >; B; =0, 7; ¥j = 0, and >7, ¥; = 0, the 
solution of the normal equations in (14.20) is given by 


ee 

LM aby Vis 

a a eee 

TB M= Vi. Vs 

b= -p=y,-5., (14.21) 
an 

¥ _ vi Yi.. Vij. be 

4 bn an abn’ 


These are unbiased estimators of the parameters 1”, a}, B*, y; in (14.14), subject to 
the side conditions in (14.15). If side conditions are not imposed on the parameters, 
then the estimators in (14.21) are not unbiased estimators of individual parameters, 
but these estimators can still be used in estimable functions. For example, consider 


the estimable function A’ B in (14.9) (for a = 3, b = 2): 
NB = a — 09 +4 (M1 + M2) — $1 + V2): 
By Theorem 12.3a and (14.21), the estimator is given by 
NB = 1 — +51 + M2) — 5a + Yr) 
=51.—¥. — Oe. —F)+90u, —y. —Ia. +5) 
+3O12, - 91. — V2, +9.) — $5 Oa, — Yo. a, +2) 
— 302, — Jn. —F2, +¥.)- 


Since Fy) + Y= 2%, and J) +¥y = 29), the estimator AB = a — G2 + 
$(¥u +Y2) - 5 (Fo + Y9) reduces to 


NB = & — G +401 + M2) — 3. Gar + 2) =i. —r.- (14.22) 


This estimator of a — a2 + 5(¥11 + Viz) — 3(%1 + Yo2) is the same as the estimator 
we would have for a} — a, using a and @> as estimators of aj and a3: 


ae 


ai — a = & — & =, —Y, — On. -¥_) =I, — Jo... 


384 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


By Theorem 12.3d, such estimators are BLUE. If we also assume that ej, is 
N(O, 0”), then by Theorem 12.3h, the estimators are minimum variance unbiased 
estimators. 


14.3.1.2 Generalized Inverse 
By Corollary 1 to Theorem 2.8b, a generalized inverse of X'X in (14.5) is given by 


(x’X) = 5 & a (14.23) 


where the Os are 6 x 6. Then by (12.13) and (14.19), a solution to the normal 
equations for a = 3 and b = 2 is given by 


B = (X'X) X’'y 
aa (0, 0, 0, 0, 0,0, ¥41.5 912.5 ¥21.5 922.931.2932) + (14.24) 


The estimators in (14.24) are different from those in (14.21), but they give the same 
estimators of estimable functions. For example, for A’ B = a, — a2 + 3(M1 PY) = 
5(Yo1 + Y22) in (14.9), we have 


NB = G1 — G2+4[M1 + M2 — Gai + V0] 
0-045 [yu +¥12 — Ga, + ¥22,)]- 


It was noted preceding (14.22) that y,; + yy. = 2y,. and yx; + Yo = 2y,. Thus vB 
becomes 


NB = 3251, — 22.) =H. — Ir, 
which is the same estimator as that obtained in (14.22) using B in (14.21). 


14.3.2 An Estimator for o7 


For the two-way model in (14.1), assumption 2 states that var(ej,.) = o” for all i, j,k. 
To estimate a7, we use (12.22), s? = SSE/ab(n — 1), where abn is the number of 
rows of X and ab is the rank of X. By (12.20) and (12.21), we have 


SSE = y'y — B'X’y = y'[I— X(X'X) Xl. 


14.4 TESTING HYPOTHESES 385 


With B from (14.24) and X’y from (14.19), SSE can be written as 
SSE = y'y — B'X'y 


a b 


7 Se = SoS yy re. 


i=1 j=l k=1 i=1 j=l 
=F 0) (14.25) 
ijk i 
It can also be shown (see Problem 14.10) that this is equal to 
SSE = S° (vin — 94.) (14.26) 
iik 
Thus, s” is given by either of the two forms 


Oe i fa (14.27) 
ann — 
2 V2 

_ Die vn tev (14.28) 


By Theorem 12.3e, E(s?) = o?. 


14.4 TESTING HYPOTHESES 


In this section, we consider tests of hypotheses for the main effects A and B and for 
the interaction AB. Throughout this section, we assume that y is Njp,(XB, oI). For 
expositional convenience, we sometimes illustrate with a = 3 and b = 2. 


14.4.1 Test for Interaction 


In Section 14.4.1.1, we express the interaction hypothesis in terms of estimable para- 
meters, and in Sections 14.4.1.2 and 14.4.1.3, we discuss two approaches to the 
full—reduced-model test. 


14.4.1.1 The Interaction Hypothesis 
By (14.8), estimable contrasts in the y,;’s have the form 


Vij = Vij" —= Vij + Vij iA i, J Ff Fs (14.29) 


We now show that the interaction hypothesis can be expressed in terms of these esti- 
mable functions. 

For the illustrative model in (14.3) with a = 3 and b = 2, the cell means in (14.12) 
are given in Figure 14.1. The B effect at the first level of A is 4,; — [1, the B effect at 
the second level of A is 4); — [y2, and the B effect at the third level of A is 3; — [39. 


386 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


A 2 Ha21 | /22 


3 | fai | Hse 


Figure 14.1. Cell means for the model in (14.2) and (14.12). 


If these three B effects are equal, we have no interaction. If at least one effect differs 
from the other two, we have an interaction. The hypothesis of no interaction can 
therefore be expressed as 


Ao: fey, — Pig = Bat — Bon = B31 — B32- (14.30) 


To show that this hypothesis is testable, we first write the three differences in terms 
of the yj’s by using (14.12). For the first two differences in (14.30), we obtain 


My — Miz = M+ a1 + By + MY — (Ut a1 + By + V2) 
= 8B, —- Bo + ¥u — Yn 

Hy, — Moy = M+ a2 + By + Yo) — (H+ a2 + By + V2) 
= B, — Bo + Ya1 — Y22- 


Then the equality 4), — My. = Moy — Moo in (14.30) becomes 


By — Po + V1 — V2 = Bi — Bo + Yar — V2 
or 


Yu — Vi2 — Ya + Y22 = 9. (14.31) 


The function y,;; — Yj2 — Y21 + 22 on the left side of (14.31) is an estimable contrast 
[see (14.29)]. Similarly, the third difference in (14.30) becomes 


B31 — B32 = By — Bo t+ ¥31 — ¥30, 


and when this is set equal to fy; — Moy = B, — By + Yo) — Y22, We obtain 


Yo1 — Y22 — Y31 + Y32 = 0. (14.32) 


14.4 TESTING HYPOTHESES 387 


By (14.29), the function y.; — yx. — Y3, + Y32 on the left side of (14.32) is esti- 
mable. Thus the two expressions in (14.31) and (14.32) are equivalent to the inter- 
action hypothesis in (14.30), and the hypothesis is therefore testable. 

Since the interaction hypothesis can be expressed in terms of estimable functions 
of y;;’s that do not involve a;’s or B,’s, we can proceed with a full—reduced-model 
approach. On the other hand, by (14.7), the a@’s and #’s are not estimable without 
the y’s. We therefore have to redefine the main effects in order to get a test in the pre- 
sence of interaction; see Section 14.4.2. 

To get a reduced model from (14.1) or (14.3), we work with i 
Mi; — Bj, — hj + &, in (14.14), which is estimable [it can be estimated unbiasedly 
by Vj = Yy. — Yi. — Yj, + y., in 14.21)]. Using (14.13), the model can be expressed 
in terms of parameters subject to the side conditions in (14.15): 


Vink = w+ of + BF + i + exp, (14.33) 


We can get a reduced model from (14.33) by setting Vij = 0. 


In the following theorem, we show that Ho: y;, = 0 for all i, j is equivalent to the 
interaction hypothesis expressed as (14.30) or as (14.31) and (14.32). Since all three 
of these expressions involve a= 3 and b=2, we continue with this illustrative 
special case. 


Theorem 14.4a. Consider the model (14.33) for a=3 and b=2. The 
hypothesis Ho: y;, = 0, i = 1,2,3, 7 = 1,2, is equivalent to (14.30) 


L 


Alo: by, — Miz = Bar — Boo = B31 — B32, (14.34) 


and to the equivalent form 
Ab: ( Yiu — VYi2 — Yar 7 3) = @ (14.35) 
Y21 — Y22 — Y31 + ¥32 0 


obtained from (14.31) and (14.32). 


Proor. To establish the equivalence of y;, = 0 and the first equality in (14.35), we 
find an expression for each y,; by setting y;; = 0. For y2 and y;, for example, we 


use (14.18) to obtain 
V2=%2- VN. — Vat ¥.- (14.36) 
Then y;, = 0 gives 


Y2=VM.t ¥2-¥.- 


388 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


Similarly, from (14.18) and the equalities yj, = 0, y;, = 0, and y;, = 0, we obtain 


YWHMU VA Ms YH VV Y= 22 = + V2 —- ¥.- 


When these are substituted into y,; — Y;. — Yo, + Y22, we obtain 


Yu — Ye — Yat Yn = MN. +1 - ¥.- M+ ¥2 -— YW) 
—(%,. +91 -VI+%. + ¥2- ¥. 
= 0, 


which is the first equality in (14.35). The second equality in (14.35) is obtained 
similarly. 

To show that the first equality in (14.34) is equivalent to the first equality in 
(14.35), we substitute yj, = + aj + Bi + yy Into fy, — Myr = My, — My?! 


O= By — Miz — Ba + bp 
=ptay t+ By + yy — (et a1 + By + V2) 
— (M+ 02 + By + Yn) + B+ O2 + Bo + V2 
= Vu — Vi2 — Yai + Y22- 


Similarly, the second equality in (14.34) is equivalent to the second equality in 
(14.35). 


In Section 14.4.1.2, we obtain the test for interaction based on the normal 
equations, and in Section 14.4.1.3, we give the test based on a generalized inverse. 


14.4.1.2_ | Full-Reduced-Model Test Based on the Normal Equations 

In this section, we develop the full—reduced-model test for interaction using the 
normal equations. We express the full model in terms of parameters subject to side 
conditions, as in (14.33) 


Vik = Be + a5 + BE + Yi, + ex, (14.37) 


where wu" = 4; = Mj, — BB} = Wy M, and ¥;, = wy — Mj, — hj + MW, are as 
given in (14.14). The reduced model under Ho: y;; = 0 for all i and j is 


Vik = MW + a; + BF + Eixx. (14.38) 


Since we are considering a balanced model, the parameters p*, a, and B; (subject 
to side conditions) in the reduced model (14.38) are the same as those in the full 


14.4 TESTING HYPOTHESES 389 
model (14.37) [in (14.44), the estimates in the two models are also shown to be the 
same]. 


Using the notation of Chapter 13, the sum of squares for testing Ho: y;; = 0 is 
given by 


SS(y|u, a, B) = SS(u, a, B, y) — SS(u, a, B). (14.39) 


The estimators j1, aj, Bj, y;,, in (14.21) are unbiased estimators of p*, a7, B;, Yii- 
Extending X’y in (14.19) from a = 3 and b = 2 to general a and b, we obtain 


SS(u, a, B, y) = BX'y 
a b 


= py. + 3 iyi. + By. F S a Vii. 
i= = i=l j= 
S90 tO Fat Ono 
i j 
+ > Gi. — Vi — Vp FY DD 4. 
ij 

-2.4(p2-2)+(p4-2) 

abn bn abn 7am abn 
(ehrkrig) we 


2, 
= yo (14.41) 
7 n 


Note that we would obtain the same result using B in (14.24) (extended to general a 
and b). 

For the reduced model in (14.38), the X; matrix and Xiy vector fora = 3 andb= 
2 consist of the first six columns of X in (14.4‘) and the first six elements of X’y in 
(14.19). We thus obtain 


X/X, = Xiy= | 1. (14.42) 


NanANBRA HA 

NNOOCHSEA 

NNOFCOH 

NNFHOCH 

SNDNNNNOAD 

DONNNOAD 
Ps 


390 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


From the pattern in (14.42), we see that for general a and b the normal equations for 
the reduced model become 


a b 
abnjt + bn os aj + an x: B = Vy... 
i=1 j=l 


b 
bn + bna; + nyS- ic} =y... t= 1,.2,.2.5 4, (14.43) 
j=l 


a 
anjutn a +anB; = yj, J=1,2,..., 0. 
i=1 


i= 
Using the side conditions 57; @; = 0 and 5> j 6; = 0, we obtain the solutions 


Vij. my _ 2 
_—f=y,-y.. (14.44) 
an 


b= a HB oe eG 


These solutions are the same as those for the full model in (14.21), as expected in the 
case of a balanced model. 
The sum of squares for the reduced model is therefore 


SS(u, @, B) = BiXiy 
2 
Dug So Oe yee oe 
abn bn abn Tan abn J’ 
and the difference in (14.39) is 


SS(y|u, a, B) = SS(u, Qa, B, Y) — SS(u, Qa, B) 


2 2 2 2 

s Vij. ~ Y; AV UY 
Sion bes vr . 14.45 
a n : bn » an © abn ( ) 


rf 


The error sum of squares is given by 
SSE = y'y — p’X'y 
y 
2 ii. 
= 5 eee 5 14.46 
Yijk —n ( ) 


(see Problem 14.13b). In terms of means rather than totals, (14.45) and (14.46) 
become 


SS(y|m, 0,8) =n S° Hy. — 9. - I, +): (14.47) 
ij 


SSE = $0 (ix — Fy)’. (14.48) 


iik 


14.4 TESTING HYPOTHESES 391 


There are ab parameters involved in the hypothesis Ho: yj, = 0, i = 1,2,..., a, 
j=1,2,..., b. However, the a+ b side conditions >>; Vij =0 for j= 1,2,...,b 
and Mi yj, =0 for i=1,2,...,a impose a—1 + b—1 restrictions. With the 
additional condition )~¥_, ea, ¥; = 0, we have a total of a+b—2+1= 
a+b-—1 restrictions. Therefore the degrees of freedom for SS(y|p, a, B) are 
ab — (a+ b—1)=(a— 1)(b— 1) (see Problem 14.14). 

To test Ho: Yi = 0 for all i, j, we therefore use the test statistic 


_ SSQ/|M, @ B)/(a — Ib — 1D) 
= SSE/ab(n — 1) : Gere) 


which is distributed as F[(a — 1)(b — 1), ab(n — 1)] if Hp is true (see Section 12.7.2). 


14.4.1.3 Full—Reduced-Model Test Based on a Generalized Inverse 

We now consider a matrix development of SSE and SS(y|, @, 8) based on a gener- 
alized inverse. By (12.21), SSE = y'[I — X(X'X) X’Jy. For our illustrative model with 
a= 3, b=2, and n = 2, the matrix X’X is given in (14.5) and a generalized inverse 
(X'X)~ is provided in (14.23). The 12 x 12 matrix X(X'X)~X’ is then given by 


Jo .:--.. oO if’ oO .:-. O 

pete =a Oj: oOo oO ii’ .. O 
XQ’/XyX =F]. oe ac ee _ |. (4.50) 

Ooo... J oO O ...: ii’ 


where J and O are 2 x 2 and j is 2 x 1 (see Problem 14.17). The vector y in (14.4) 
can be written as 


Yi 
Y12 


y=| 21], (14.51) 


where yi = @ ), i=1, 2,3, j=1, 2. By (12.21), (14.50), and (14.51), SSE 
ij2 
becomes 


SSE = y'[I— X(X'X) X'ly = y'y — y'X(X'X) X'y 


en 2 wey a 4 2 
= SS" Yin = 5 i yb; = ik Yijk eae ij. 
ik 


which is the same as (14.46) with n = 2. 


392 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


For SS(y|, a, B), we obtain 


SS(y| 4, Qa, B) = SS(u, a, B, Y) —_ SS(u, Qa, B) 


= pX'y — B,Xiy 


= y [X(X'X)X’ — X\ (XX) Xi ly, 


(14.52) 


where X(X’X)X’ is as found in (14.50) and X consists of the first six columns of X 
in (14.4). The matrix xX) is given in (14.42), and a generalized inverse of xX) is 


given by 

(XX) =% 
Then 

4J 
23 
> J 
XXX X=b] 5 
J 
—J 


oOo: 0: Oo 


2J 
4J 
—J 
J 
—J 
J 


coococowoeo 
cocooowocoe 


—J 


0 0 0 

0 0 0 

0 0 0 

3 0 0 

0 2 0 

0 0 2 
—J J —J 
J —J J 
2 J —J 
45 -—J J 
—J 43 25 
J 2 43 


’ 


(14.53) 


(14.54) 


where J is 2 x 2. For the difference between (14.50) and (14.54), we obtain 


X(X'XK)X! — XXX) X, = 4 


where J is 2 x 2. 


—2J 
2J 


(14.55) 


To show that SS(y|m, a, B) = y/[X(X'X) X’ — X)(X/,X1) X’Jy in (14.52) is 
equal to the formulation of SS(y|, a, B) shown in (14.45), we first write (14.45) 


14.4 TESTING HYPOTHESES 393 


in matrix notation: 


= ; Yi 3. y? > i y zal 1 1 1 5 
i=1 j=l i=l j=l 1 


We now find A,B,C, and D. For 55), y;, = y’Ay, we have by (14.50) and 
(14.51), 


3. 2 
1 Dy a . 
29} = 2D, > Yad Yop 
ij i=1 j=l 
where j is 2 x 1. This can be written as 
ji O --. O Yu 
oO jj -- oO Y12 
5G =A Vine Yad 2 ; (14.57) 
ij eat ; : 
00: jj Y32 
=7y'Ay, 
where 
JIo=.:--.- oO 
Oo J O 
A= : 
Ooo .: J 


and J is 2 x 2. Note that by (14.50), we also have A = X(X’X)X’. 

For the second term in (14.56), oa ae we first use (14.51) to write y,;, and 
2 
yj, as 


: : j 
Yi = S- Yiik = So yink + S— vin = Vid + Yai = Vie vo) ), 
jk k k 


J\ oy anf Ya i’ ij’ Yi 
ye = Was y)( 4d i( ) = (a> yo( :) ( i 
j Yio ii ii’) \ ye 


394 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


Thus !5~?_, y?_ can be written as 


2 
LS oye, =EO 1 Vine +s ¥40) 
i=1 


ooooune 
ooooue 
oOouae O80 
oOoa ae O80 
““1 O 000 
“ “1 O 0 00 


= 7y'By. 


Similarly, the third term of (14.56), ye Yio can be written as 


“Oe O80 
ou0e0¢ 
“Oo O40 


aI 

II 

al 

<<. 
on Oe O04 
oe 0e04 


For the fourth term of (14.56), y /12, we have 


y.. = oye =¥ iv 


ik 


“Oe 080 


y =ZyCy. 


by”. = pY jnipy = bYIvy = py'Dy, 


Yu 


Y31 


(14.58) 


(14.59) 


(14.60) 


where ji> is 12 x 1 and Jj2 is 12 x 12. To conform with A, B, and C in (14.57), 


(14.58), and (14.59), we write D = Jj as 


ie ile Mile in | 
Sn | 
ee ile Mile in | 


where J is 2 x 2. 


en in ihn hn | 


ee iin iin in in| 


en in ihn | 


14.4 TESTING HYPOTHESES 395 


Now, combining (14.57)—(14.60), we obtain the matrix of the quadratic form in 
(14.56): 


Nie 
> 

Ale 
ie] 

a 
oe) 
+ 
oS 


J -J -2J 2 J —-J ]? rel) 


—J J —-J J 23 —25 
J —-J J -J -2J 2 


which is the same as (14.55). Thus the matrix version of SS(y|, a, B) in (14.52) is 
equal to SS(y|u, a, B) in (14.45): 


2 9 2 2 

7 — ~ Vij ~ Yi Ni. y 
‘TEX (X’X)7X! — X1(X).X1)7X ly = : = sere 
y [X(X'X) 1X) Xi) Xly fon ~ bn ea 


14.4.2. Tests for Main Effects 


In Section 14.4.2.1, we develop a test for main effects using the full—reduced—model 
approach. In Section 14.4.2.2, a test for main effects is obtained using the general 
linear hypothesis approach. Throughout much of this section, we use a= 3 and 
b = 2, where a is the number of levels of factor A and b is the number of levels of 
factor B. 


14.4.2.1_ Full—Reduced-Model Approach 
If interaction is present in the two-way model, then by (14.9) and (14.10), we cannot 
test Hp: a) = a) = ay (for a = 3) because a; — ay and a, — a3 are not estimable. In 
fact, there are no estimable contrasts in the a’s alone or the B’s alone (see Problem 
14.2). Thus, if there is interaction, the effect of factor A is different for each level 
of factor B and vice versa. 

To examine the main effect of factor A, we consider a; = fi; — j_, as defined in 
(14.14). This can be written as 


j=l j=l 
1 BM; 
“32 (4-D%) 
1 
2 oe (Hi — Bj): (14.62) 
j 


The expression in parentheses, j1;; — (4, is the effect of the ith level of factor A at the 
Jjth level of factor B. Thus in (14.62), a} = ju; — j,, is expressed as the average effect 


396 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


of the ith level of factor A (averaged over the levels of B). This definition leads to the 
side condition 5°, aj = 0. 

Since the @’s are estimable [see (14.21) and the comment following], we can use 
them to express the hypothesis for factor A. For a = 3, this becomes 


Ho: a = a5 = a, (14.63) 


which is equivalent to 


Ho: a, = a, = a, =0 (14.64) 


because )>; af = 0. 

The hypothesis Ho: a} = a5 = a3 in (14.63) states that there is no effect of factor 
A when averaged over the levels of B. Using a} = fi; — &., we can express 
Ho: a} = a = a3 in terms of means: 


Ao: fy, — fh = Bo, — Bh = Bs. — Bh, 


which can be written as 


Ao: fy, = By, = Ms.- 
The values for the cell means in Figure 14.2 illustrate a situation in which Ho holds in 
the presence of interaction. 

Because Ho in (14.63) or (14.64) is based on an average effect, many texts 
recommend that the interaction AB be tested first, and if it is found to be significant, 
then the main effects should not be tested. However, with the main effect of A defined 
as the average effect over the levels of B and similarly for the effect of B, the tests for 
A and B can be carried out even if AB is significant. Admittedly, interpretation 
requires more care, and the effect of a factor may change if the number of levels 
of the other factor is altered. But in many cases useful information can be gained 
about the main effects in the presence of interaction. 


B 


1 2 means 


Figure 14.2 Cell means illustrating 4, = f., = fz, in the presence of interaction. 


14.4 TESTING HYPOTHESES 397 


Under Ho: a} = a3 = a3 = 0, the full model in (14.33) reduces to 


Vink = M+ BE + Vi + ei. (14.65) 


Because of the orthogonality of the balanced model, the estimators of p*, B;, and yj; 


in (14.65) are the same as in the full model. If we use jf, Bj, and Vii in (14.21) and 
elements of X’y in (14.19) extended to general a, b, and n, we obtain 


b a b 
SS(u, B, y) = ay. + 9° By + S06 Yi 
= I 


i=l j= 


which, by (14.40), becomes 
2 2 2 
ae Big AWE. 
S804 BY) = abn - (>: an =.) 


yz y? y y 
mo cie Sd ds raf tse . 14.66 
a (= n : bn ae 3) ( ) 


From (14.40) and (14.66), we have 


SS(a|u, B, Y) = SS(u, Qa, B, Y) a SS(u, B, Y) 
= ara (14.67) 
For the special case of a = 3, we see by (14.7) that there are two linearly independent 
estimable functions involving the three a’s. Therefore, SS(a|, B, y) has 2 degrees of 


freedom. In general, SS(a|, B, y) has a — 1 degrees of freedom. 
In an analogous manner, for factor B we obtain 


SS(B|u, Qa, Y) i SS(u, a, B, Y) a SS(u, Qa, Y) 


b 42 2 
= (14.68) 
n abn 


which has b — | degrees of freedom. 
In terms of means, we can express (14.67) and (14.68) as 


a 


SS(a|u, B, y) = bn Y~ 5; - 9), (14.69) 


i=1 


398 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


b 
SS(Blu, a, y) = an (9, — 9). (14.70) 
j=1 


j= 


It is important to note that the full-reduced-model approach leading to 
SS(alu, B, y) in (14.67) cannot be expressed in terms of matrices in a manner 
analogous to that in (14.52) for the interaction, namely, SS(y|u, a, B) = 
y [X(X’X)" X’ — X,(X/,X1) Xi‘ Jy. The matrix approach is appropriate for the inter- 
action because there are estimable functions of the y;,’s that do not involve mw or 
the a; or 6; terms. In the case of the A main effect, however, we cannot obtain a 
matrix X; by deleting the three columns of X corresponding to aj, a2, and a3 
because contrasts of the form a; — a2 are not estimable without involving the y,;’s 
[see (14.9) and (14.10)]. 

If we add the sums of squares for factor A, B, and the interaction in (14.67), 
(14.68), and (14.45), we obtain Dy ¥5/n —y /abn, which is the overall sum of 
squares for “treatments,” SS(a@, B, y|). This can also be seen in (14.40). In the fol- 
lowing theorem, the three sums of squares are shown to be independent. 


Theorem 14.4b. If y is Nain(XB, 02D, then SS(alu, B, y), SS(Bly, a, y), and 
SS(y|u, @, B) are independent. 


Proor. This follows from Theorem 5.6c; see Problem 14.23. 


Using (14.45), (14.46), (14.67), and (14.68), we obtain the analysis-of-variance 
(ANOVA) table given in Table 14.1. 


TABLE 14.1 ANOVA Table for a Two-Way Model with Interaction 


Source of 
Variation df Sum of Squares 
Factor A a-1 yy 
~~ abn 
Factor B b-1 yy. y 
Dian abn 
2 2 2 2 
_ = Yi Yi. Mi Wes 
Interaction (a — 1)(b — 1) Yi i (oe ae a ABA 
ye 
Error ab(n — 1) ik Voie - Des 
y¥ 
Total abn — 1 ik Vane ae 
URUK abn 


14.4 TESTING HYPOTHESES 399 


The test statistic for factor A is 


has SS(a|p, B, y)/(a 7. 1) 


F , 
SSE/ab(n — 1) 


(14.71) 


which is distributed as Fla — 1, ab(n — 1)] if Ho: aj = a =--- = a = 0 is true. 


For factor B, we use SS(B| 2, a, y) in (14.68), and the F statistic is given by 


F , 
SSE/ab(n — 1) 


which is distributed as F[b — 1, ab(n — 1)] if Ho: B} = BS =--- = B; = Ois true. In 
Section 14.4.2.2, these F statistics are obtained by the general linear hypothesis 
approach. The F distributions can thereby be justified by Theorem 12.7c. 


Example 14.4. The moisture content of three types of cheese made by two methods 
was recorded by Marcuse (1949) (format altered). Two cheeses were measured for 
each type and each method. If method is designated as factor A and type is factor 
B, then a = 2, b =3, and n= 2. The data are given in Table 14.2, and the totals 
are shown in Table 14.3. 

The sum of squares for factor A is given by (14.67) as 


2 2 


Ye. y 
— (3)2) (2)3)@2) 

= £[(221.98)? + (220.81)?] — 4, (442.79) 
= .114075. 


SS(a|p, B, Y) se 


TABLE 14.2 Moisture Content of Two Cheeses from 
Each of Three Different Types Made by Two Methods 


Type of Cheese 


Method 1 2 3 

1 39.02 35.74 37.02 
38.79 35.41 36.00 

2 38.96 35.58 35.70 


400 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


TABLE 14.3 Totals for Data in Table 14.2 


B 
A 1 2 3 Totals 
1 yu, = 77.81 Yo. = 71.15 y13. = 73.02 V1. = 221.98 
2 Yo, = 77.97 Yoo, = 71.10 Yo3, = 71.74 yy. = 220.81 
Totals y1, = 155.78 yo, = 142.25 y3 = 144.76 y. = 442.79 
Similarly, for factor B we use (14.68): 
3 2, 2 
vi. y 
SS , a, _ J os 
(Blu, a, y= S- oom 


j=l 


= +[(155.78)° + (142.25)? + (144.76)7] — 4, (442.79) 


= 25.900117. 


For error, we use (14.46) to obtain 


2 
SSE =) Yin — 3 yi. 
ijk 


= (39.02) + (38.79) +--+ + (36.04)? — 4[(77.81)? +--+ + (71.74)7] 


= 16,365.56070 — 16364.89875 = .661950. 


The total sum of squares is given by 


2 
ssT =~ y3, - S = 26.978692. 
ijk 


The sum of squares for interaction can be found by (14.45) or by subtracting all other 


terms from the total sum of squares: 


SS(y|u, @, B) = 26.978692 — .114075 — 25.900117 — .661950 


= .302550. 


With these sums of squares, we can compute mean squares and F statistics as 


shown in Table 14.4. 


Only the F test for type is significant, since F516 = 5.99 and F952 = 5.14. The 
p value for type is .0000155. The p values for method and the interaction are .3485 


and .3233, respectively. 


14.4 TESTING HYPOTHESES 401 


TABLE 14.4 ANOVA for the Cheese Data in Table 14.2 


Source of Sum of Mean 
Variation Squares df Square F 
Method 0.114075 1 0.114075 1.034 
Type 25.900117 2 12.950058 117.381 
Interaction 0.302550 2 0.151275 1.371 
Error 0.661950 6 0.110325 

Total 26.978692 11 


Note that in Table 14.2, the difference between the two replicates in each cell is 
very small except for the cell with method 1| and type 3. This suggests that the repli- 
cates may be repeat measurements rather than true replications; that is, the exper- 
imenter may have measured the same piece of cheese twice rather than measuring 
two different cheeses. 


14.4.2.2_ General Linear Hypothesis Approach 

We now obtain SS(a|y, B, y) for a= 3 and b=2 by an approach based on the 
general linear hypothesis. Using a} = a; — @ + y; — y, in (14.16), the hypothesis 
Ho: aj = a = a3 in (14.63) can be expressed as Ho: a, + ¥j,= 02+ 9%, = 
a3 + Y3. Or 


Ho: a + 5M + Y\2) = @2 + $(%1 + Yo) = 03 + 5(%1 + 32) (14.72) 


[see also (14.9) and (14.10)]. The two equalities in (14.72) can be expressed in 
the form 


He: a1 t+3%1 +512 — 03 — 4-31 — 3 V2 _ [90 
0: Erle el il LE AO} 
a2 + 5 Y21 7 2 Y22 — %3 — 5 ¥31 — 9 32 


Rearranging the order of the parameters to correspond to the order in 
B= (H, a1, 05, 03, By, Bo, Vit» Yio» Yair Y22» Yai> Y32)’ in (14.4), we have 


1 1 1 1 

/f 1 % F951 +312 — 3 ¥31 — 2 %32 \ (0 

Ab: oi a 4 i = (4): (14.73) 
a2 — 03 1+ 5 Var 1 9 V22 — 9 ¥31 — 3 32 


402 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


which can now be written in the form Ho: CB = 0 with 


Cvre 
C= 
0 


oo 
NI- 
° 
i) 
| 


jo) 
One 


0 01 -1 


Nie NI 
Nie NI 


) (14.74) 


Nie 
Nie 


By Theorem 12.7b(iii), the sum of squares corresponding to Ho: CB = 0 is 
SSH = (CB)'[C(X'X)-C'}'CB. (14.75) 
Substituting B = (X’X) X’y from (12.13), SSH in (14.75) becomes 
SSH = y’X(X'X) C'[C(X’X)C’}"'C(X'X) X’y = y’ Ay. (14.76) 


Using C in (14.74), (X’X)~ in (14.23), and X in (14.4), we obtain 


pee dp OM ed) A SO: nO 1 1 1 1 
C(x) X' =; , (14.77) 
00001 1141 1 1 1 1 
c(x’'x)-C' =! 21 ly)-O"7-1 _ 4 2-1 
(XX) C =; » [CXC] =3 (14.78) 
1 2 -1 2 
Then A = X(X/X)~-C’'[C(X'X)- C’]-'C(X’X)-X’ in (14.76) becomes 
2J -J —-J 
A=,|-J 2 -J], (14.79) 
-J -J 23 


where J is 4 x 4. This can be expressed as 


WI -S -J 33 O O JJ J 
A=4/-s3 2 -J}=4/0 33 O}] -S[a 3 J). (1480) 
J -J 2s 0 O 35 Js J 


To evaluate y’Ay, we redefine y in (14.51) as 


Yu 
a a Yi 
y=|]°7! |= | y, ], where ( ") = yy (14.81) 
Y22 i2 
¥3 
Y31 


14.5 EXPECTED MEAN SQUARES 403 


Then (14.76) becomes 


3J4 O O 


SSH = y'Ay=(y\,y5,y3)| O 34 
Oo Oo 


3 
=) yay; — by Ivy 
i=1 


1 Sr he 
=> Yiaiayi — bY boliey 


_ywu 


4 12° 


Yi 
y2 | —pYSvy 
Y3 


which is the same as SS(a|p, B, y) in (14.67) with a= 3 and b=n=2. 
The sum of squares for testing the B main effect can be obtained similarly using a 


general linear hypothesis approach (see Problem 14.25). 


14.55 EXPECTED MEAN SQUARES 


We find expected mean squares by direct evaluation of the expected value of sums of 
squares and also by a matrix method based on the expected value of quadratic forms. 


14.5.1 Sums-of-Squares Approach 


The expected mean squares for the tests in Table 14.1 are given in Table 14.5. 
Note that these are expressed in terms of a*, B;, and y;; Subject to the side 


TABLE 14.5 Expected Mean Squares for a Two-Way ANOVA 


Sum of 

Source Squares Mean Square Expected Mean Square 
A SS(a|u, BY) SS(a|m, B. y) ai” 

a—l > + nD; aria 
B SS(B|u, a, Y) SS(B|u, a, Y) Bi 

b-1 tani 
AB SS(y M, Oo, B) SS(¥ M, a, B) 2 > vy 

(a— 1)(6— 1) "Lui @—Db—1) 

Error SSE SSE feed 


404 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


conditions >7,a; = 0, >7, 6; =0, and >; ¥; = dU; ¥j; = 0. These expected mean 
squares can be derived by inserting the model yj = w* + af + B; + Yj + ijk 10 
(14.33) into the sums of squares and then finding expected values. We illustrate 
this approach for the first expected mean square in Table 14.5. 

To find the expected value of SS(a|q, B, y) = >>, v7. /bn — y? /abn in (14.67), we 
first note that by using assumption | in Section 14.1, we can write assumptions 2 and 
3 in the form 


Ee.) =? for all i,j,k, (14.82) 
E(eijx€rss) =0 forall (i,j,k) 4 (7, 8,0. (14.83) 


Using these results, along with assumption | and the side conditions in (14.15), we 
can show that E(y*. )= @b'n’ pw? + abno? as follows: 


2 
EQ”) = (So =E 


ijk 


2 


So (ui tat + B+ yj + eXn) 
ijk 


2 
= £m be aD ++ Dew) 


ijk 


2 
= E} a’ b’n? pw + 2abnp* S- Eijk + (x: on] 


ik ik 


= Pb "ae e(S “,) ve = nate 


ijk ijk Arst 


=abn ie + abno’. 


It can likewise be shown that 


E (>: 4)= = abn??? + Bn? Saf? + abno? (14.84) 
i=1 


i=l 
(see Problem 14.27). Thus 


g SGea) 1 e( yh z.) 


a-1 ; bn abn 


7 1 ¢ ab?n 2 pe? bt Nia, abo? abn 2 ye abno2 
a-1l 


= bn bn ae abn 


= — [i = Hor a “ ' 


14.5 EXPECTED MEAN SQUARES 405 


The other expected mean squares in Table 14.5 can be obtained similarly (see 
Problem 14.28). 


14.5.2 Quadratic Form Approach 


We now obtain the first expected mean square in Table 14.2 using a matrix approach. 
We illustrate with a = 3, b = 2, and n = 2. By (14.75), we obtain 


E[SS(a|, B, yy] = E{(CB) IC(X'X) CP 'CB}. (14.85) 


The matrix C contains estimable functions, and therefore by (12.44) and (12.45), we 
have E(CB) = CB and cov(CB) = a*C(X’X) C’. If we define G to be the 2 x 2 
matrix [C(X’X)C’]~!, then by Theorem 5.2a, (14.85) becomes 


E[SS(a|n, B, y)] = E[(CB)'G(CB)| 
= tr[G cow(CB)] + [E(CB)|GIE(CB)] 
= tr(Go*G"') + (CB) G(CB) 
= 207 + BCIC(X’X) C]"'CB (14.86) 


= 207 + PLB, (14.87) 


where L = C’[C(X’X)-C’}"'C. Using C in (14.74) and [C(X’X)-C’}-! in (14.78), L 
becomes 


0 0 0 000 0 0 0 0 0 0 
Oe 38 Se SO Oe OE od es es a 
Ot 24. . sR Shs Oi eo A a ee 
Oink ah OB Do Ae eS a 
0 0 0 000 0 0 0 0 0 +0 
Os MO OE OO OY OO BS OO 0 
ay; “Ae ao pee) Re bs i ce I 88) 
OA EO 0” Be Oe et a, 
ie A EO ose St 3B 82ers 
Or a> he 20" Oo ead a 8 ee at 
Ae SO Ae Oy a Sy Ss AD. D 
03 fa, i ed, Cd eel | i GS 


TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


406 


This can be written as the difference 


/ 
6? 


0 00000 0 0 0 
0 0066 0 0 0 0 


12 0 00006 6 0 0 


0 
0 


0 


12 
0 0 
0 


12 00 0 0 0 0 6 6 
0 00000 0 0 0 
0 00000 0 0 0 
0 003 3 00 0 0 
0 003 3 00 0 0 
0 00003 3 0 0 


0 
0 
0 
0 
0 
6 
6 
0 
0 


0 
0 
0 


6 
0 
0 
0 
0 


0 00003 3 0 0 
6 00 0 0 0 0 3 3 
6 00 0 0 0 0 3 3 


000000 0 0 0 0 0 0 
0444002 22 2 2 2 


Boo = joj 


000000 0 0 0 0 0 0 


022 2 0 0 
022 2 0 0 
022 2 0 0 
022 2 0 0 
022 2 0 0 


000000 0 0 0 0 0 0 
022 2 0 0 


0444002 22 2 2 2 
044400222 2 2 2 


where Ay; = 1213, By, = 4j3j 


Oo —) 
Si ey 
mle 
| 
SS 
coe Od 
— ooo) 
Su Oe 
oosocse 
Nee 


/ 
3? 


Bat = 2jej 


/ 
6? 


By = 2j3j 


/ 
3° 


0’ 
0’ 
6j, 


oj, 0 
An=] 0 6j, » Ad 
0 860 j 


14.5 EXPECTED MEAN SQUARES 


34. O O 
An=| O  3}j5 0 
O 0 3h 


If we write B in (14.4) in the form 


B = (M, a’, Pi, Po, nae 


407 


where a! = (a, @2,03) andy! =(Y1, Yi2> Yo1> Yoo» Yai» 32), then BLB in 


(14.87) becomes 


B'LB = }a'Aj a+ fa'Apy+ 7 y Anat: y'Any—ta'Bia 
—ta'Boy—5y'Bya—}y'Boy. 


Since A}, = Aj and Bj, = By, this reduces to 


B'LB =40'Aj a+ 3a/Apyt 7 y'Any—fa'Bia 
—fa'Boy—37'Bovy. 


If we partition y as y’ = (Yj, ¥3, ¥3), where y/ = (¥1.7;2), then 
jp 0 0’ v1 


fa'Apy=#a'| 0’ jy 0’ Y 
0 0 75 % 


by 3 
= 4a bY =45o ay; 
17% = 


Now, using the definitions of Aj, Ao2, By;, Biz, and Bz2 following (14.89), we 


obtain 


3 3 
B'LB = 40'a+4S ay, + > Yin; — $e jsiia 
i=1 i=l 


— $a’ jsigy — 4 V'isigy 


3 3 3 
=4N af +43 cay, + So ¥, - $e? - fay, -4y?. (14.90) 
i=l i=l i=l 


408 TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 
By expressing y,, a@,, and y_ in terms of means, (14.90) can be written in the form 


3 3 
B'LB=4S (ai-a+%-7) =4S 007? — [by (14.16). (14.91) 
i=1 


i= i=1 


For an alternative approach leading to (14.91), note that since E(CB) = CB, 
(14.86) can be written as 


E[SS(a|m, B, y)] = 207 + [E(CB)I'[C(X'X) C'T 'E(CB). (14.92) 


By (14.75), SS(a|u, B, y) = SSH = (CB)'[C(X'X) C'}"'CB. Thus, by (14.92), we 
can obtain E[SS(a|u, B, y)] by replacing CB in SS(a|u, B, y) with CB and adding 
207. To illustrate, we replace y,; and y with E(,) and E(y_) in 
SS(a|mu, B, y) = ASCs (¥; —¥_)° in (14.69). We first find E(y;_): 


EV) = E(t ym) = £)_ E(x) 
i jk 


= 4) Eu t a+ B+ yy + ee) 
jk 


=4)>_ (+ a + B+ 7) 
jk 
=3(4n +40; +2508) +2>> ¥y) 
j i 
=pt+at+B+y,. (14.93) 


Similarly 


EG J=p+a+B+y.. (14.94) 
Then, 


3 
E[SS(a|, B, y)] = 20° +45 — [EG;,.) — EG )P 
i=1 


i= 


=20° +4) (ut+a+B+y¥,—-w-a-B+y/y 


= 20° +4) (@j-a4+%-¥) 


=20°+45S ai? [by (14.16)]. 


PROBLEMS 409 


PROBLEMS 
14.1. Obtain 6, and 65 in (14.7) from (14.6). 
14.2 In a comment following (14.8), it is noted that there are no estimable con- 
trasts in the a’s alone or B’s alone. Verify this statement. 
14.3 Show that +(6; + 65 + 63') has the value shown in (14.11). 
14.4 Verify the following results in (14.15) using the definitions of a, B;, and ¥; 
in (14.14): 
(a) )0, a7 =0 
(b) yy B; =0 
(3 4=0, FSI 220.58 
(d) >; ¥,=9, 1=1,2,...,a 
14.5 Verify the following results from (14.15) using the definitions of a, 8; , and 
yj in (14.16), (14.17), and (14.18): 
(a) )0, a7 =0 
(b) SG B; =0 
(c) 0; ¥ = 9, j=1,2,...,5 
(d) >7; ¥j = 9, HW D> ahaa 
14.6 (a) Show that B} = B; — B + ¥;— ¥, as in (14.17). 
(b) Show that y; = yj — y, — y; + ¥, as in (14.18). 
14.7 Show that @; and ¥,; in (14.21) are unbiased estimators of a; and y;; as noted 
following (14.21). 
14.8 (a) Show that y,, + ¥,. = 2y,_ and that ¥,, +}. = 2y,_, as used to obtain 
(14.22). 
(b) Show that @ — @& +5 (¥1 + V2) — Qo + ¥22) =i, — Fn, aS in 
(14.22). 
14.9 Show that (X’X)~ in (14.23) is a generalized inverse of X’X in (14.5). 
14.10 Show that SSE in (14.26) is equal to SSE in (14.25). 
14.11 Show that the second equality in (14.34) is equivalent to the second equality 
in (14.35); that is, py; — Moy = M3, — M32 ~Implies  y2; — Y2— 
Yai + ¥32 = 0. 
14.12 Show that >, 0. — 9... = 0, y7./bn — y? /abn and 
that 57 Gy. — Fi. — Fj. + Ivy. = Ly yin — Vivi. /bn — yyj,/ant+ 
y? /abn, as in (14.40). 
14.13. (a) In a comment following (14.41), it was noted that the use of B from 


(14.24) would produce the same result as in (14.41), namely, 
B'X'y = 30,97, /n. Verify this. 


410 


14.14 


14.15 


14.16 


14.17 


14.18 


14.19 
14.20 
14.21 


14.22 


14.23 
14.24 


14.25 


14.26 


14.27 
14.28 


TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


(b) Show _ that Oe a Dik ik —n ar Vi in (14.25) is equal to 
SSE = i Vie — ey y/2 in (14.46). 
Show that (a — 1)(b — 1) is the number of independent Vij terms in Ho: i= 


0 fori=1,2,...,aandj= 1,2, ...,b, as noted near the end of Section 
14.4.1.2. 


Show that SS(y|u, a, B) = nN» vik CP ie eet ae y ) in (14.47) is the 
same as SS(y|p, a, B) in (14.45). 

Show that SSE= ik vik — Vy) in (14.48) is equal to 
SSE = ik Vain — ayy /n in (14.46). 

Using X’X in (14.5) and (X'X)~ in (14.23), show that X(X’X)~X’ has the 
form given in (14.50). 

(a) Show that (X;Xi)- in (14.53) is a generalized inverse of X;X) in (14.42). 
(b) Show that X;(X;Xi) X; has the form given by (14.54). 


Show that ee yy. can be written in the matrix form given in (14.59). 

Show that 5A — +B — iC + 4D has the value shown in (14.61). 

Show that Hp:aj =a,=a,; in (14.63) is equivalent to 

Ho: a = a = af = 0 in (14.64). 

Obtain SS(y, a, y) and show that SS(B|y, a, y) = oar y;,/bn —y’ /abn as 

in (14.68). 

Prove Theorem 14.4b for the special case a = 3, b = 2, and n = 2. 

(a) Using C in (14.74), (X'X)~ in (14.23), and X in (14.4), show that 
C(X'X)X’ is the 2 x 12 matrix given in (14.77). 

(b) Using C in (14.74) and (X’X)~ in (14.23), show that C(X’X)C’ is the 
2 x 2 matrix shown in (14.78). 

(c) Show that the matrix A = X(X’X) C’[C(X'X)C’]-'C(X'X)-X’ has 
the form shown in (14.79). 


For the B main effect, formulate a hypothesis Hp: CB = 0 and obtain 
SS(6|u, a, y) using SSH in (14.75). 


Using assumptions 1, 2, and 3 in Section 14.1, show that E(e?,) = o° forall 

i,j, k and E(eyxe,s:) = O for (i,j,k) A (7, 8,0), as in (14.82) and (14.82). 

Show that E( 374, y?.) = ab?n? yw? + b’n? S™_, a? + abno” as in (14.84). 

(a) Show that E( er yj) = @ bn? pw? + an? ae Bi + abno?. 

(b) Show that E( Ly) = abn’ + bn* >, a* + an* DG BP + 
n a i? + abno?. 

(c) Show that E[SS(B| 4, a, y)/(b — 1)] = 0? + an), Bi7/(b — 1). 

(d) Show that E[SS(y|p, @ B)/(a — 1b — 1) = of 
nya 7 M(a — 1-1). 


14.29 


14.30 
14.31 


14.32 


PROBLEMS 


TABLE 14.6 Lactic Acid’ at Five Successive Time Periods for 
Fresh and Wilted Alfalfa Silage 


Period 

Condition 1 2 3 4 5 
Fresh 13.4 37.5 65.2 60.8 37.7 

16.0 42.7 54.9 57.1 49.2 
Wilted 14.4 29.3 36.4 39.1 39.4 

20.0 34.5 39.7 38.7 39.7 
“In mg/g of silage. 

Using C in (14.74) and (X’X) in (14.23), show 


L = C’[C(X'X)C’]"'C has the form shown in (14.88). 


Expand ey (a -a+y — yy in (14.91) to obtain (14.90). 
(a) Show that E(y_) = pw 
(b) Show that E(y;,) = a@+a4+ B+ ¥;- 
(c) Show that EQ.) = w+ aj + Bi + ¥j- 


ta 4 


| B 4 


+ y. as in (14.94). 


411 


that 


Obtain the following expected values using the method suggested by (14.92) 
and illustrated at the end of Section 14.5.2. Use the results of Problem 


14.31b, c. 


(a) E[SS(B|u, a, y)] = 0? + 67, B? 
(b) E[SS(y|u, a, B)] = 207 +29, v7 


TABLE 14.7 Hemoglobin Concentration (g/mL) in Blood of Brown Trout* 


Rate: 1 2 3 4 

Method: A B A B A B A B 
6.7 7.0 9.9 9.9 10.4 9.9 9.3 11.0 
7.8 7.8 8.4 9.6 8.1 9.6 9.3 9.3 
5.5 6.8 10.4 10.2 10.6 10.4 7.8 11.0 
8.4 7.0 9.3 10.4 8.7 10.4 7.8 9.0 
7.0 9 10.7 11.3 10.7 11.3 9.3 8.4 
7.8 6.5 11.9 9.1 9.1 10.9 10.2 8.4 
8.6 5.8 TA 9.0 8.8 8.0 8.7 6.8 
74 7A 6.4 10.6 8.1 10.2 8.6 7.2 
5.8 6.5 8.6 11.7 7.8 6.1 9.3 8.1 
7.0 535 10.6 9.6 8.0 10.7 7.2 11.0 


“After 35 days of treatment at the daily rates of 0, 5, 10, and 15g of sulfamerazine per 100 Ib of fish 
employing two methods for each rate. 


412 


14.33 


14.34 


TWO-WAY ANALYSIS-OF-VARIANCE: BALANCED CASE 


A preservative was added to fresh and wilted alfalfa silage (Snedecor 1948). 
The lactic acid concentration was measured at five periods after ensiling 
began. There were two replications. The results are given in Table 14.6. 
Let factor A be condition (fresh or wilted) and factor B be period. Test for 
main effects and interactions. 


Gutsell (1951) measured hemoglobin in the blood of brown trout after treat- 
ment with four rates of sulfamerazine. Two methods of administering the sul- 
famerazine were used. Ten fish were measured for each rate and each method. 
The data are given in Table 14.7. Test for effect of rate and method and 
interaction. 


15 Analysis-of- Variance: The 
Cell Means Model for 
Unbalanced Data 


15.1 INTRODUCTION 


The theory of linear models for ANOVA applications was developed in Chapter 12. 
Although all the examples used in that and the following chapters have involved 
balanced data (where the number of observations is equal from one cell to 
another), the theory also applies to unbalanced data. 

Chapters 13 and 14 show that simple and intuitive results are obtained when the 
theory is applied to balanced ANOVA situations. Intuitive marginal means are 
informative in analysis of the data [e.g., see (14.69) and (14.70)]. When applied to 
unbalanced data, however, the general results of Chapter 12 do not simplify to intui- 
tive formulas. Even worse, the intuitive marginal means one is tempted to use can be 
misleading and sometimes paradoxical. This is especially true for two-way or higher- 
way data. As an example, consider the unbalanced two-way data in Figure 15.1. The 
data follow the two-way additive model (Section 12.1.2) with no error 


yy = B+ aj + B, i= 1,2, j= 1,2, 


where = 25, a; = 0, a 20, B; =0, B2=5. Simple marginal means of the 
data are given to the right and below the box. 

The true effects of factors A and B are, respectively, ag—a, = —20 and B, — 
6, =5. Even for error-free unbalanced data, however, naive estimates of these 
effects based on the simple marginal means are highly misleading. The effect of 
factor A appears to be 8.75 — 25.125 = — 16.375, and even more surprisingly the 
effect of factor B appears to be 15 — 20 = —5. 

Still other complications arise in the analysis of unbalanced data. For example, it 
was mentioned in Section 14.4.2.1 that many texts discourage testing for main effects 
in the presence of interactions. But little harm or controversy results from doing so 
when the data are balanced. The numerators for the main effect F tests are exactly 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


413 


414 ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 


B 
nee 
25 | 30 
1| 25 25.125 
25 
A 
5 | 10 
2 10 | 8.750 
10 
20. 15 


Figure 15.1 Hypotetical error-free data from an unbalanced two-way model. 


the same whether the model with or without interactions is being entertained as the 
full model. Such is not the case for unbalanced data. The numerator sums of 
squares in these F tests depend greatly on which model is used as the full model, 
and, obviously, conclusions can be affected. Several types of sums of squares 
[usually types I, IL, and III; see Milliken and Johnson (1984, pp. 138-158)] have 
been suggested to help clarify this issue. 

The issues involved in choosing the appropriate full model for a test are subtle and 
often confusing. The use of different full models results in different weightings in the 
sums of squares calculations and expected mean squares. But some of the same 
weightings also arise for other reasons. For example, the weights might arise 
because the data are based on “probability proportional to size” (pps) sampling of 
populations (Cochran 1977, pp. 250-251). 

Looking at this complex issue from different points of view has led to completely 
contradictory conclusions. For example, Milliken and Johnson (1984, p. 158) wrote 
that “in almost all cases, type IIJ sums of squares will be preferred,’ whereas Nelder 
and Lane (1995) saw “no place for types III and IV sums of squares in making infer- 
ences from the use of linear models.” 

Further confusion regarding the analysis of unbalanced data has arisen from the 
interaction of computing advances with statistical practice. Historically, several 
different methods for unbalanced data analysis were developed as approximate 
methods, suitable for the computing resources available at the time. Looking back, 
however, we simply see a confusing array of alternative methods. Some such 
methods include weighted squares of means (Yates 1934; Morrison 1983, pp. 407-— 
412), the method of unweighted means (Searle 1971; Winer 1971), the method of 
fitting constants (Rao 1965, pp. 211-214; Searle 1971, p. 139; Snedecor and 
Cochran 1967), and various methods of imputing data to make the dataset balanced 
(Hartley 1956; Healy and Westmacott 1969; Little and Rubin 2002, pp. 28-30). 

The overparameterized (non-full rank) model (Sections 12.2, 12.5, 13.1, and 14.1) 
has some advantages in the analysis of unbalanced data, while the cell means 
approach (Section 12.1.1) has other advantages. The non-full rank approach builds 
the structure (additive two-way, full two-way, etc.) of the dataset into the model 
from the start, but relies on the subtle concepts of estimability, testability, and 


15.2 ONE-WAY MODEL 415 


generalized inverses. The cell means model has the advantages of being a full-rank 
model, but the structure of the dataset is not an explicit part of the model. 
Whichever model is used, hard questions about the exact hypotheses of interest 
have to be faced. Many of the complexities are a matter of statistical practice rather 
than mathematical statistics. 

The most extreme form of imbalance is that in which one or more of the cells have 
no observations. In this “empty cells” situation, even the cell means model is an over- 
parameterized model. Nonetheless, the cell means approach allows one to deal 
specifically with nonestimability problems arising from the empty cells. Such an 
approach is almost impossible using the overparameterized approach. 

In the remainder of this chapter we discuss the analysis of unbalanced data using 
the cell means model. Unbalanced one-way and two-way models are covered in 
Sections 15.2 and 15.3. In Section 15.4 we discuss the empty-cell situation. 


15.2 ONE-WAY MODEL 


The non-full-rank and cell means versions of the one-way unbalanced model are 


Vij = wt Qj + Ey (15.1) 
Ses (15.2) 
PSUR id, FA ae 


For making inferences, we assume the é,;’s are independently distributed as M(O, oO). 


15.2.1 Estimation and Testing 


To estimate the ;’s, we begin by writing the N = 5°; n; observations for the model 
(15.2) in the form 


y=Wp+te, (15.3) 
where 
1 0 0) 
1 0 0 
0 1 0 My 
7 : : My 
~10 1 Gl Ee 
: My 
0 0 1 


416 ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 


The normal equations are given by 
W Wh = Wy, 


where WW = diag(m, m, ...,n) and Wy=(1.,y2,---,¥%), with y, = 
Saee yj.. Since the matrix W is full rank, we have, by (7.6) 


j= (WW) 'W'y (15.4) 
M1. 
Yo, 
=y=| |, (15.5) 
Yk. 
where ¥, = Doj-1 Y/N. 
To test Ho: by = fy =--* = My, We compare the full model in (15.2) and (15.3) 


under Ho. (We do not use the AGiSiiOH p* in the reduced model because there is no 
in the full model yj = yu; + €j.) In matrix form, the N observations in the reduced 
model become y = pj + €*, where j is N x 1. For the full model, we have SS(1;, 
Ho, +++ My) = B'W’y, and for the reduced model, we have SS(u) = pj’y = Ny’, 
where N= )¢;n; and y = Dey a/N.- The difference SS(u1, My,---, Me) — SS(w) 
is equal to the regression sum of squares SSR in (8.6), which we denote by SSB 
for “between” sum of squares 


with the reduced model yj; = sz + &€;,, where is the common value of (11, Mo, -.. Ux 


k 
SSB = p'W'y — Ny” = Diy. — NY (15.6) 
i=1 
k_ 2 2 
Ji, 
= Sic 15.7 
La ae (15.7) 


where y., = }7,, yj and y, = y./N. From (15.7), we see that SSB has k — 1 degrees 
of freedom. The error sum of squares is given by (7.24) or (8.6) as 


SSE = y'y — p'W'y 


k nj k 2, 
= >o4- 04, (15.8) 
i=l 


i=l j=l 


which has N — k degrees of freedom. These sums of squares are summarized in 
Table 15.1. 


15.2 ONE-WAY MODEL 417 


TABLE 15.1 One-Way Unbalanced ANOVA 


Source Sum of Squares df 
Between SSB = 0, y;/ni—y°/N ay 
Error SSE =); Vp — Dye / mi Nk 
= BID - 
Total SST = 0,93 —y/N Nt 


The sums of squares SSB and SSE in Table 15.1 can also be written in the form 


k 
SSB = Sony, 3), (15.9) 
i=1 


k Nj 
SSE = 3° S° (vy - 5, (15.10) 


i=l j=l 


If we assume that the y,,’s are independently distributed as N(wu;, o°), then by 


Theorem 8.1d, an F statistic for testing Ho: wy; = UW) = --: = py, is given by 
B/(kK-1 
= SSB/(k — 1) (15.11) 
SSE/(N — k) 


If Ho is true, F is distributed as F(k — 1, N — k). 


Example 15.2.1. A sample from the output of five filling machines is given in 
Table 15.2 (Ostle and Mensing 1975, p. 359). 

The analysis of variance is given in Table 15.3. The F is calculated by 
(15.11). There is no significant difference in the average weights filled by the five 
machines. 


15.2.2. Contrasts 


A contrast in the population means is defined as 6 = cy + CoM) + +++ + Cee, 
where ae ci; =0. The contrast can be expressed as 5=c'p, where 


TABLE 15.2 Net Weight of Cans Filled by Five 
Machines (A-E) 


A B C D E 

11.95 12.18 12.16 12.25 12.10 
12.00 12.11 12.15 12.30 12.04 
12.25 12.08 12.10 12.02 


418 ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 


TABLE 15.3, ANOVA for the Fill Data in Table 15.2 


Sum of Mean Dp 
Source df Squares Square F Value 
Between 4 05943 .01486 1.9291 .176 
Error 11 .08472 .00770 
Total 15 14414 


SST = LaF —y’/N and p = (My, My,..-, My)’. The best linear unbiased estimator 
of 6 is given by é= CLV}, + Co¥o, +--+ + cxyy. = efi [see (15.5) and Corollary 1 
to Theorem 7.3d]. By (3.42), var(8) =oe (W'W) le, which can be written as 
var(d) = o bee ¢?/n;, since WW = diag(n, m,..., nx). By (8.38), the F statistic 
for testing Ho: 6 = 0 is 

- (ejay! [e’(W' Ww) !c] Cia 


5 : (15.12) 


S 


= (oe 1¢ om) (TK c?/ni) 


: ; (15.13) 


S 


where s* = SSE/(N — k) with SSE given by (15.8) or (15.10). We refer to the numera- 
tor of (15.13) as the sum of squares for the contrast. If Ho is true, the F statistic in 
(15.12) or (15.13) is distributed as F(1,N—k), and we reject Hp: 6=0 if 
F > Fai,n- x or ifp < a, Whee ene 2 value 

Two contrasts, say, 6 = ea | ayi. and y = yon , Diy, are said to be orthogonal 
if ey a;b; =0. However, in the case of unbalanced data, two orthogonal 
contrasts of this type are not independent, as they were in the balanced case 
(Theorem 13.6a). 


Theorem 15.2. If the y;;'s are independently distributed as N(;, a7) in the unba- 
lanced model (15.2), then two contrasts 6 = ys ay, and y= 22, b;y,, are inde- 
pendent if and only if S~_, ajbi/n; = 0. 


Proor. We express the two contrasts in vector notation as 5=al y and y=b'y, 
where y = (¥,, ¥2,.--, ¥¢). By (7.14), we obtain 


0 I/m ... O 
cov(y) = 0° (WW)!=o7] | =o’D. 


15.2. ONE-WAY MODEL 419 
Then by (3.43), we have 
cov(d, 7) = cov(a'y, b'y) = a/cov(y)b = o?a'Db 


; (15.14) 


=Co 


k 
a abi 
= 


nj 


L 


Hence, by Theorem 4.4c, 5 and ¥ are independent if and only if }>ja;b;/n; = 0. 


We refer to contrasts whose coefficients satisfy )*; a,b;/n; = 0 as weighted orthog- 
onal contrasts. If we define k — 1 contrasts of this type, they partition the treatment 
sum of squares SSB into k — 1 independent sums of squares, each with | degree of 
freedom. Unweighted orthogonal contrasts that satisfy only }*>; a;b; = 0 are not inde- 
pendent (see Theorem 15.2), and their sums of squares do not add up to the treatment 
sum of squares (as they do for balanced data; see Theorem 13.6a). 

In practice, weighted orthogonal contrasts are often of less interest than 
unweighted orthogonal contrasts because we may not wish to choose the a;’s and 
b;’s on the basis of the n,’s in the sample. The n,’s seldom reflect population charac- 
teristics that we wish to take into account. However, it is not necessary that the sums 
of squares be independent in order to proceed with the tests. If we use unweighted 
orthogonal contrasts with )°;a;b;= 0, the general linear hypothesis test based 
on (15.12) or (15.13) tests each contrast adjusted for the other contrasts (see 
Theorem 8.4d). 


Example 15.2.2a. Suppose that we wish to compare the means of three treatments 
and that the coefficients of the orthogonal contrasts 6 = a'p and y= b'p are 
given by a’ = (2 — 1 — 1) and b’ = (0 1 —1) with corresponding hypotheses 


1 
Ao; : by = 5 He + 3), Hor : by = Bs. 


If the sample sizes for the three treatments are, for example, n; = 10, n2 = 20, and 
n3 = 5, then the two estimated contrasts 


6 = 2y, —Yo—Y3, and Y=yYo— Ys, 


are not independent, and the corresponding sums of squares do not add to the treat- 
ment sum of squares. 

The following two vectors provide an example of contrasts whose coefficients 
satisfy )>; a;b;/n; = 0 for ny = 10, no = 20, and n3 = 5: 


a = (25—20—5) and b'=(0 1-1). (15.15) 


420 ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 


However, a’ leads to the comparison 


4 1 
A3: 25, = 20> + 5p, or Ho3 : by = ste + 53 


which is not the same as the hypothesis Ho; : uw, = 5 (My + p23) that we were initially 
interested in. 


Example 15.2.2b. We illustrate both weighted and unweighted contrasts for the fill 
data in Table 15.2. Suppose that we wish to make the following comparisons of the 
five machines: 


A,D_ versus B,C,E 
B, E versus D 

A versus D 

B _ versus E 


Orthogonal (unweighted) contrast coefficients that provide these comparisons are 
given as rows of the following matrix: 


<2 =2 3, 2 
1 -2 0 1 
0 0 -l 0 


or OW 


We give the sums of squares for these four contrasts and the F values [see (15.13)] in 
Table 15.4. 

Since these are unweighted contrasts, the contrast sums of squares do not add up 
to the between sum of squares in Table 15.3. None of the p values is less than .05, 
so we do not reject Ho: >>; cj; = 0 for any of the four contrasts. In fact, the p 
values should be less than .05/4 for familywise significance (see the Bonferroni 
approach in Section 8.5.2), since the overall test in Table 15.3 did not reject 
Ay > My = My = Ms. 


TABLE 15.4 Sums of Squares and F Values for 
Contrasts for the Fill Data in Table 15.2 


Contrast Dp 
Contrast df SS F Value 


A, D versus B, C, E 1 .005763 0.75 .406 
B, E versus C 1 .002352 0.31 592 
A versus D 1 .034405 4.47 .0582 
B versus E 1 .013333 1.73 S25 


15.3. TWO-WAY MODEL 421 
As an example of two weighted orthogonal contrasts that satisfy )°;a;b;/n;, we 


keep the first contrast above and replace the second contrast with (0 2 —6 0 4). 
Then, for these two contrasts, we have 


0. 


pu _ 30) 22) 26) ie 3(0)_ 2(4) 
Nj 4 2 3 3 4 


The sums of squares and F values [using (15.13)] for the two contrasts are as 
follows: 


Contrast df Contrast Fp Value 
SS 


A,DversusB,C,E 1 .005763 75 .406 
B, E versus C 1 .005339 .69 423 


15.3. TWO-WAY MODEL 


The unbalanced two-way model can be expressed as 


Vik = B+ 4 + B+ Vig + Ei (15.16) 


= by + Eins (15.17) 
i=1,2,...,a, j=1,2,...,b, k=1,2,..., mj. 


The €;;,’8 are assumed to be independently distributed as N(O, O°). In this section we 
consider the case in which all n;; > 0. 

The cell means model for analyzing unbalanced two-way data was first proposed 
by Yates (1934). The cell means model has been advocated by Speed (1969), 
Urquhart et al. (1973), Nelder (1974), Hocking and Speed (1975), Bryce (1975), 
Bryce et al. (1976, 1980b), Searle (1977), Speed et al. (1978), Searle et al. (1981), 
Milliken and Johnson (1984, Chapter 11), and Hocking (1985, 1996). Turner 
(1990) discusses the relationship between (15.16) and (15.17). In our development 
we follow Bryce et al. (1980b) and Hocking (1985, 1996). 


15.3.1 Unconstrained Model 


We first consider the unconstrained model in which the y,;’s are unrestricted. To 
accommodate a no-interaction model, for example, we must place constraints on 
the j4;'s. The constrained model is discussed in Section. 

To illustrate the cell means model (15.17), we use a = 2 and b = 3 with the cell 
counts nj given in Figure 15.2. This example with N = )7j;n;; = 11 will be referred 
to throughout the present section and Section 15.3.2. 


422 ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 


1] nay = 2 | rig = 1 | m3 = 2 


2 | na = 1] ng =3 | no3 = 2 


Figure 15.2 Cell counts for unbalanced data illustration. 


For each of the 11 observations in Figure 15.2, the model yi = fu + &ijx 1s 
Yi = My 7 E111 
Y1i12 = My 7 E112 


Yi21 = By 7 €121 


Y231 = M23 7 €231 


232 = Mo3 1 £232, 


or in matrix form 


y= Wee, (15.18) 
where 
100 0 0 0 
100 0 0 0 
Yiu 0 10 0 0 0 
yin 001 0 0 0 
y= ; >» w=]0 01 0 0 Of, 
232 
000 0 0 1 
000 0 0 1 
My 
ae €111 
€112 
M13 
= . E= 
E M21 : 
Moo £232 
M3 


Each row of W contains a single 1 that corresponds to the appropriate y,; in wt. For 
example, the fourth row gives y13; = (001000)m + €13; = fy43 + €131. In this illus- 
tration, y and € are 11x1, and W is 116. In general, y and € are Nx 1, and 
W is Nx ab, where N = 5° jj nj. 


15.3. TWO-WAY MODEL 423 
Since W is full-rank, we can use the results in Chapters 7 and 8. The analysis is 


further simplified because W’W = diag(11, 112, 113, 221, 222, 123). By (7.6), the 
least-squares estimator of ys is given by 


f= (WW) 'W’y =y¥, (15.19) 


where ¥ = (¥1>, 913.5 Y14.> Yo1.> 225.923.) contains the sample means of the cells, 
Yi. = Doe Yiik/Ny- By (7.14), the covariance matrix for ft is 


cov(~t) = 0° (WW)! = o*aiag( } ; Bea. i ) (15.20) 


os > 
my N12 123 
Lf OF oo? o 
= diag > > ee > ag 
my M12 123 


For general a, b, and N, an unbiased estimator of o [see (7.23)] is given by 


eee ioe Wp) (y — WA) 
VE N—ab ° 


(15.21) 


where vg = )vi_, a , (nj — 1) = N — ab, with N = 97; n;. In our illustration with 
a= 2 and b= 3, we have N — ab= ay —6=5. Two aleinative forms of SSE are 


SSE = y'[I— W(W'W)"'W'ly [see (7.26)], (15.22) 
SSE = 5 ue yi) [see (14.48)]. (15.23) 
i=l j=l k=1 


Using (15.23), we can express s* as the pooled estimator 


es Vit ee 1 (ny — sj 


15.24 
W — ab , ( ) 


where sy is the variance estimator in the (ij)th cell, Si ie (vik — Vy)? /(nj — YD. 

The Overparametctiged model (15.16) ‘cludes. paenneiers representing main 
effects and interactions, but the cell means model (15.17) does not have such par- 
ameters. To carry out tests in the cell means model, we use contrasts to express the 
main effects and the interaction as functions of the ,;’s in w. We begin with the 
main effect of A. 

In the vector p= (415 Myo 13> Mor» Moo» Mo3), the first three elements corre- 
spond to the first level of A and the last three to the second level, as seen in 
Figure 15.3. Thus, for the main effect of A, we could compare the average of j111, 


424 ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 


1} fat | a2 | 13 


2 | far | M22 | /23 


Figure 15.3 Cell means corresponding to Figure 15.1. 


M12, and p13 with the average of 421, f22, and p23. The difference between these 
averages (sums) can be conveniently expressed by the contrast 


a! p= py + by + M13 — Mo — Mo — Moa, 
=(d, 1, 1, —1, —1, —I1)p. 


To compare the two levels of A, we can test the hypothesis Ho: a’ = 0, which can 
be written as Ho : (My, — May) + (Mia — Ba2) + (Hy3 — #23) = 0. In this form, Ho 
states that the effect of A averaged (summed) over the levels of B is 0. This corre- 
sponds to a common main effect definition in the presence of interaction; see com- 
ments following (14.62). Note that this test is not useful in model selection. It 
simply tests whether the interaction is “symmetric” such that the effect of A, averaged 
over the levels of B, is zero. 

Factor B has three levels corresponding to the three columns of Figure 15.3. In a 
comparison of three levels, there are 2 degrees of freedom, which will require two con- 
trasts. Suppose that we wish to compare the first level of B with the other two levels and 
then compare the second level of B with the third. To do this, we compare the average 
of the means in the first column of Figure 15.3 with the average in the second and third 
columns and similarly compare the second and third columns. We can make these com- 
parisons using Ho : b} ¢ = 0 and b, mu = 0, where b/w and b} pm are the following two 
orthogonal contrasts: 


Bho = (my) + Mor) — (Miz + Boz) — (Hig + B3) (15.25) 
= 2h) — Mig — M13 + 2M, — Mor — M93 
= (2, -1, —1, 2, -1, —l)p, 

by m= (Hyp + Hao) — (M13 + Mas) (15.26) 
= Piya — Big + Mao — M3 
=O 1. =O hye. 


We can combine b,’ and b,’ into the matrix 


<i Nn Oe ea Oe eek 
te rt eas ae a meee 


15.3. TWO-WAY MODEL 425 


and the hypothesis becomes Hop: Bu = 0, which, by (15.25) and (15.26), is 
equivalent to 


Ao = fy + Moy = Mia + Bo = M3 + B93 (15.28) 


(see Problem15.9). In this form, Hp states that the interaction is symmetric such that 
the three levels of B do not differ when averaged over the two levels of A. Note that 
other orthogonal or linearly independent contrasts besides those in b,’ and b,’ would 
lead to (15.28) and to the same F statistic in (15.33) below. 

By analogy to (14.30), the interaction hypothesis can be written as 


Ao: By — Boy = My2 — Mog = My3 — My35 


which is a comparison of the “A effects” across the levels of B. If these A effects 
differ, we have an interaction. We can express the two equalities in Hp in terms of 
orthogonal contrasts similar to those in (15.25) and (15.26): 


CM = (My, — Ma) — (M2 — Ba) — (M3 — M3) = 9, 
C5 = (My2 — Map) — (M3 — Mp3) = 0. 


Thus Ho can be written as Hy: Cu = 0, where 


BSG \ fea a ee EA 
= Vie) 0, a vem “30: ars ay 


Note that ¢; can be found by taking products of corresponding elements of a and b, 
and © can be obtained similarly from a and b), where a, b,, and b, are the coefficient 
vectors in a’p, b,’ and bo’ p. Thus 


e, = [(1)(2), (ID, (- D, (- D2), (-1)(-D), (- (-D) 
= (2, —1, —1, —2, 1, 1), 

e, = [(1)0), (DC), ((-D, (DO), (- DG), (DC DI 
= (0, 1, —1, 0, —1, 1). 


The elementwise multiplication of these two vectors (the Hadamard product — see 
Section 2.2.4) produces interaction contrasts that are orthogonal to each other and to 
the main effect contrasts. 

We now construct tests for the general linear hypotheses Ho: a’ = 0, Ho: Bu = 
0, and Hy: Cy = 0 for the main effects and interaction. The hypothesis Hp: a’ = 0 
for the main effect of A, is easily tested using an F statistic similar to (8.38) or (15.12): 


_ (a’pla'(W'W) !al (aa) SSA 


F 
s? SSE/v_’ 


(15.29) 


426 ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 


where s° is given by (15.21) and vg = N — ab. [For our illustration, N — ab = 11 — 
(2)(3) = 5.] 
If Ho is true, F in (15.29) is distributed as F(1, N — ap). 
The F statistic in (15.29) can be written as 
ap 2 
ph 2k es (15.30) 
s2a/(W'W) ‘a 
_\2 

(Zi) (15.31) 

Sos a; /nij : , 


which is analogous to (15.13). Since (vp) = F(, vz) (see Problem 5.16), a ¢ statistic 
for testing Ho: a's = 0 is given by the square root of (15.30) 
x a’ ju — ap—0 
sy/al(W'W) !a—\/Var(a’ pa) 


(15.32) 


which is distributed as t(N — ab) when Hp is true. Note that the test based on either of 
(15.29) or (15.32) is a full-reduced-model test (see Theorem 8.4d) and therefore tests 
for factor A “above and beyond” (adjusted for) factor B and the interaction. 

By Theorem 8.4b, a test statistic for the factor B main effect hypothesis Hyp: Bu=0 
is given by 


a (Bi [BOW'W) |B 'Ba/ve _ SSB/ve 
SSE/v; SSE/vp’ 


(15.33) 


where vg = N — ab and vz is the number of rows of B. (For our illustration, ve = 5 
and vg = 2.) When Hp is true, F in (15.33) is distributed as F(vg, vz). 
A test statistic for the interaction hypothesis Hp: Cu = 0 is obtained similarly: 


(Cay [CW W)'C')"'Cp/v4g — SSAB/vag 


F 
SSE/vz SSE/ve ” 


(15.34) 


which is distributed as F(v,z, Vez), where vy, the degrees of freedom for interaction, 
is the number of rows of C. (In our illustration, v4z = 2.) 

Because of the unequal n,;’S, the three sums of squares SSA, SSB, and SSAB do 
not add to the overall sum of squares for treatments and are not statistically indepen- 
dent, as in the balanced case [see (14.40) and Theorem 14.4b]. Each of SSA, SSB, 
and SSAB is adjusted for the other effects; that is, the given effect is tested “above 
and beyond” the others (see Theorem 8.4d). 


Example 15.3a. Table 15.5 contains dressing percentages of pigs in a two-way 
classification (Snedecor and Cochran 1967, p. 480). Let factor A be gender and 
factor B be breed. 


15.3. TWO-WAY MODEL 427 


TABLE 15.5 Dressing Percentages (Less 70%) of 75 Swine Classified 
by Breed and Gender 


Breed 


Male Female Male Female Male Female Male Female Male Female 


13.3 18.2 10.9 14.3 13.6 12.9 11.6 13.8 10.3 12.8 
12.6 11.3 3.3 15.3 13.1 14.4 13.2 14.4 10.3 8.4 


11.5 14.2 10.5 11.8 4.1 12.6 4.9 10.1 10.6 
15.4 15.9 11.6 11.0 10.8 15.2 6.9 13.9 
12.7 12.9 15.4 10.9 14.7 13.2 10.0 
15.7 15.1 14.4 10.5 12.4 11.0 
13.2 11.6 12.9 12.2 
15.0 14.4 12.5 13.3 
14.3 75 13.0 12.9 
16.5 10.8 7.6 9.9 
15.0 10.5 12.9 
13.7 14.5 
10.9 
13.0 
15.9 
12.8 


We arrange the elements of the vector yx to correspond to a row of Table 15.5, that is 


B= (Mi, Miz Bar Baas Msp)’, 


where the first subscript represents breed and the second subscript is associated with 
gender. 

The vector wis 10 x 1, the matrix W is 75 x 10, the vector a is 10 x 1, and the 
matrices B and C are each 4 x 10. We show a, B, and C: 


es, ee a ad ae 
— Ph eh ce 0 0 0 0 0 0 
00 0 0 41 +41 -2 -2 : 
O20 Oe 0. T - ae > 0P Sle 
96628) OB ee. ee De De i, 3D 
as I Ss SP, A. OO 2 Oe, OE 
Oe 20h) 200 SO: ea eC 
OO OO. 08 Sh OO Hb 1 


428 ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 


TABLE 15.6 ANOVA for Unconstrained Model 


Sum of Mean Dp 
Source df Squares Square F Value 
A (gender) 1 1.984 1.984 0.303 584 
B (breed) 4 90.856 22.714 3.473 0124 
AB 4 24.876 6.219 0.951 440 
Error 65 425.089 6.540 


Total 74 552.095 


(Note that other sets of othogonal contrasts could be used in B, and the value of Fz 
below would be the same.) By (15.19), we obtain 


je = ¥ = (14.08, 14.60, 11.75, 12.06, 10.40, 13.65, 13.28, 11.03, 11.01, 11.14)’. 


By (15.22) or (15.23) we obtain SSE = 425.08895, with vg = 65. Using (15.29), 
(15.33), and (15.34), we obtain 


F, = 30337, Fp = 3.47318, Fe = .95095. 


The sums of squares leading to these Fs are given in Table 15.6. Note that the sums 
of squares for A, B, AB, and error do not add up to the total sum of squares because 
the data are unbalanced. (These are the type III sums of squares referred to in 
Section 15.1.) 


15.3.2 Constrained Model 


To allow for additivity or other restrictions, constraints on the 41,’ must be added to 
the cell means model (15.17) or (15.18). For example, the model 


Vik = Pej + ijk 
cannot represent the no-interaction model 


Vijk = M+ a + B+ Fixx (15.35) 


unless we specify some relationships among the j4,;’s. 
In our 2 x 3 illustration in Section 15.3.1, the two interaction contrasts are expres- 


sible as 
(2 a ee 
sige. Sea ee Mm ae 


If we wish to use a model without interaction, then Cyr = 0 is not a hypothesis to be 
tested but an assumption to be included in the statement of the model. 


15.3. TWO-WAY MODEL 429 


In general, for constraints Gy = 0, the model can be expressed as 
y= We + € subject to Gy = 0. (15.36) 


We now consider estimation and testing in this constrained model. [For the case 
Gp =h, where h # 0, see Bryce et al. (1980b).] 

To incorporate the constraints Gu=0 into y=Wyp-+e, we can use the 
Lagrange multiplier method (Section 2.14.3). Alternatively, we can reparameterize 
the model using the matrix 


A= he. (15.37) 


where K specifies parameters of interest in the constrained model. For the no-inter- 
action model (15.35), for example, G would equal C, the first row of K could corre- 
spond to a multiple of the overall mean, and the remaining rows of K could include 
the contrasts for the A and B main effects. Thus, we would have 


Oo 1-1 0 1 -!I 
0 1-1 0 -1 1 


The second row of K is a’ and corresponds to the average effect of A. The third and 
fourth rows are from B and represent the average B effect. 

If the rows of G are linearly independent of the rows of K, then the matrix A in 
(15.37) is of full rank and has an inverse. This holds true in our example, in which 
we have G = C. In our example, in fact, the rows of G are orthogonal to the rows 
of K. We can therefore insert A_'A = I into (15.36) to obtain the reparameterized 
model 


y = WA 'Ap+e subject to Gu=0 


=Z6+e subject to Gu=0, O38) 


where Z= WA! and 6= Ap. 

In the balanced two-way model, we obtained a no-interaction model by simply 
inserting y,° = 0 into yx = w+ a; + B; + ¥;; + €i [(see 14.37) and (14.38)]. To 
analogously incorporate the constraint Gw=0 directly into the model in the 


430 ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 
unbalanced case, we partition 6 into 
_a,—(K)\, _ (Ke) _ (4 
bree ran) (n) 
With a corresponding partitioning on the columns of Z, the model can be written as 
oT 
y = Z64 € = (Zi, Z) 5 +€é 
2 


=Z,6,+%6)+e subjectto Gu=0. (15.39) 


Since 6; = Gy, the constraint Gu = 0 gives 6, = 0 and the constrained model in 
(15.39) simplifies to 


y=Z\6, +. (15.40) 
An estimator of 6, [see (7.6)] is given by 
5; = (ZiZ,) 'Ziy. 


To obtain an expression for ys subject to the constraints, we multiply 
_—(a%)\_ (4 
m= (3)=(9) 


A’! = (K*, G*). 


by 


If the rows of G are orthogonal to the rows of K, then 
(K*, G*) = [K'(KK’) |, G(GG’)"] (15.41) 


(see Problem15.13). If the rows of G are linearly independent of (but not necessarily 
orthogonal to) the rows of K, we obtain 


k* = H¢K’(KH¢k’) "|, (15.42) 
where 
Hg =1- GGG) 'G, 
and G* is similarly defined (see Problem15.14). In any case, we denote the product of 
K* and 6, by pm: 
MF k* 6; : 
We estimate pr, by 


ju, = K*8, = K*(ZZ,)'Ziy, (15.43) 


15.3. TWO-WAY MODEL 431 


which has covariance matrix 
cov(ft,) = 0° K*(Z)Z1) 1K”. (15.44) 


To test for factor B in the constrained model, the hypothesis is Ho: Bu. = 0. The 
covariance matrix of Bi, is obtained from (3.44) and (15.44) as 


cov(Bft,) = 0° BK*(Z,Z,) KB’. 
Then, by Theorem 8.4b, the test statistic for Ho: By. = 0 in the constrained model 


becomes 


_ Bp) [BK*(Z)Z1) 1K" BT Bp, /vg 


F 
SSE. /vz, ; 


(15.45) 


where SSE, (subject to Gy = 0) is obtained using (1, [from (15.43)] in (15.21). (In 
our example, where G = C for interaction, SSE, effectively pools SSE and SSAB 
from the unconstrained model.) The degrees of freedom vg. is obtained as 
Ve, = Ve +rank(G), where vg =N-—ab is for the unconstrained model, as 
defined following (15.21). [In our example, rank(G) = 2 since there are 2 degrees 
of freedom for SSAB.] We reject Ho: Bu, = 0 if F > Faysy,,, where Fy is the 
upper @ percentage point of the central F distribution. 
For Ho: a’, = 0, the F statistic becomes 


_ (a’ fu.) [a'K*(Z),Z,)'K*“a]'(a’ pt, 
SSE, /Ve, ° 


(15.46) 


which is distributed as F(1, vz.) if Ho is true. 


Example 15.3b. For the pigs data in Table 15.5, we test for factors A and B in a no- 
interaction model, where factor A is gender and factor B is breed. The matrix G is the 
same as C in Example 15.3a. For K we have 


i a a Pe Ae Wis ie ff 
5 hy? Re Sa? ee esha 
Rete ee (2 oe D120 =D) ee OD 
te i, SS OO Os, Oe SW -*6 
Gi Oe Ce AE ee eS 

O20. 0808: in 2) Oe St 


By (15.43), we obtain 
fi. = (14.16, 14.42, 11.77, 12.03, 11.40, 11.65, 12.45, 12.70, 10.97, 11.22)’ 


432 ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 


TABLE 15.7 ANOVA for Constrained Model 


Sum of Mean Dp 
Source df Squares Square F Value 
A (gender) 1 1.132 1.132 0.17 .678 
B (breed) 4 101.418 25.355 3.89  .00660 
Error 69 449.965 6.521 
Total 74 552.0955 


For SSE., we use ft, in place of f in (15.21) to obtain SSE, = 449.96508. For vz., 
we have 


Ve, = Ve +rank(G) = 65+ 4 = 69. 


Then by (15.45), we obtain Fz, = 3.8880003. The sums of squares leading to Fz. and 
F, are given in Table 15.7. 


15.4 TWO-WAY MODEL WITH EMPTY CELLS 


Possibly the greatest advantage of the cell means model in the analysis of unbalanced 
data is that extreme situations such as empty cells can be dealt with relatively easily. 
The cell means approach allows one to deal specifically with nonestimability pro- 
blems arising from the empty cells (as contrasted with nonestimability arising from 
overparameterization of the model). Much of our discussion here follows that of 
Bryce et al. (1980a). 

Consider the unbalanced two-way model in (15.17), but allow nj to be equal to 0 
for one or more (say m) isolated cells; that is, the empty cells do not constitute a 
whole row or whole column. Assume also that the empty cells are missing at 
random (Little and Rubin 2002, p. 12); that is, the emptiness of the cells is indepen- 
dent of the values that would be observed in those cells. 

In the empty cells model, W is non-full-rank in that it has m columns equal to 0. 
To simplify notation, assume that the columns of W have been rearranged with the 
columns of 0 occurring last. Hence 


W = (Wi, O), 


where W, is an n x (ab — m) matrix and O is n x m. Correspondingly 
— { Mo 
pee). 


where jp, is the vector of cell means for the occupied cells while yx, is the vector of 
cell means for the empty cells. The model is thus the non-full-rank model 


y=, 0%") +8. (15.47) 


e 


15.4 TWO-WAY MODEL WITH EMPTY CELLS 433 


The first task in the analysis of two-way data with empty cells is to test for the 
interaction between the factors A and B. To test for the interaction when there are iso- 
lated empty cells, care must be exercised to ensure that a testable hypothesis is being 
tested (Section 12.6). The full—reduced-model approach [see (8.31)] is useful here. A 
sensible full model is the unconstrained cell means model in (15.47). Even though W 
is not full-rank 


SSE, = y'[I— W(W'W) Wy (15.48) 


is invariant to the choice of a generalized inverse (Theorem 12.3e). The reduced 
model is the additive model, given by 


y= WA 'Ap+e_ subject to Gu=0, 


where 


in which K is a matrix specifying the overall mean and linearly independent main 
effect contrasts for factors A and B, and the rows of G are linearly independent inter- 
action contrasts (see Section 15.3.2) such that A is full-rank. We define Z, as WK* 
[see (15.41)]. Because the empty cells are isolated, Z, is full-rank even though some 
of the constraints in Gy = 0 are nonestimable. The error sum of squares for the addi- 
tive model is then 


SSE, = y'[I — Zi\(Z\Z1)'Zi ly, (15.49) 


and the test statistic for the interaction is 


_ (SSE, — SSE,,)/[(a — D = 1) — m] 


F 
SSE,,/(n — ab +m) 


(15.50) 


Equivalently the interaction could be tested by the general linear hypothesis 
approach in (8.27). However, a maximal set of nonestimable interaction side con- 
ditions involving yt, must first be imposed on the model. For example, the side con- 
ditions could be specified as 


Tu —0, (15.51) 
where T is an m x ab matrix with rows corresponding to the contrasts py; — Mj, — 


; + m.. for all m empty cells (Henderson and McAllister 1978). Using (12.37), 
we obtain 


jpe=(WW+T'T) 'Wy (15.52) 


434 ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 


and 
cov(ft) = 0° (WW + TT) (WWW W + TT). (15.53) 


The interaction can then be tested using the general linear hypothesis test of 
Ho: Cu = 0 where C is the full matrix of (a — 1)(b — 1) interaction contrasts. 
Even though some of the rows of Cy are not estimable, the test statistic can be com- 
puted using a generalized inverse in the numerator as 


_ (Cit (Cleov(fa)/o7IC'}- (Cay /[(a ~ 1b = =m) 


F 
SSE/(n — ab + m) 


(15.54) 


The error sum of squares for this model, SSE, turns out to be the same as SSE,, in 
(15.48). By Theorem 2.8c(v), the numerator of this F statistic is invariant to the 
choice of a generalized inverse (Problem 15.16). 

Both versions of this additivity test involve the unverifiable assumption that 
the means of the empty cells follow the additive pattern displayed by the means of 
the occupied cells. If there are relatively few empty cells, this is usually a reasonable 
assumption. 

If the interaction is not significant and is deemed to be negligible, the additive 
model can be used as in Section 15.3.2 without any modifications. The isolated 
empty cells present no problems for the use of the additive model. 

If the interaction is significant, it may be possible to partially constrain the inter- 
action in an attempt to render all cell means (including those in yr,) estimable. This is 
not always possible, because it requires a set of constraints that are both a priori 
reasonable and such that they render pr estimable. Nonetheless, it is often advisable 
to make this attempt because no new theoretical results are needed. The greatest chal- 
lenges are practical, in that sensible constraints must be used. Many constraints will 
do the job mathematically, but the results are meaningless unless the constraints are 
reasonable. Unlike many other methods associated with linear models, the validity of 
this procedure depends on the parameterization of the model and the specific con- 
straints that are chosen. 

We proceed in this attempt by proposing partial interaction constraints 


Gp =0 


for the empty cells model in (15.47). We choose K such that its rows are linearly 
independent of the rows of G so that 


(0 


is nonsingular. Thus A” ' = (K* G*) as in the comments following (15.41). Suppose 
that the constraints are realistic, and that they are such that the constrained model is 
not the additive model; that is, at least a portion of the interaction is unconstrained. 
Then, if Z, = WK* is full-rank, all the cell means (including ms,) can be estimated as 


fe=K*(ZiZ) 'Ziy, (15.55) 


15.4 TWO-WAY MODEL WITH EMPTY CELLS 435 


and cov(jt) is given by (15.44). Further inferences about linear combinations of the 
cell means can then be readily carried out. If Z, is not full-rank, care must be exer- 
cised to ensure that only estimable functions of y are estimated and that testable 
hypotheses involving pm are tested (see Section 12.2). 

A simple way to quickly check whether Z, is full-rank (and thus all cell means are 
estimable) is given in the following theorem. 


Theorem 15.4. Consider the constrained empty cells model in (15.47) with m empty 


cells. Partition A as 
_{K)\ (Ki Kk 
a=(6)-(6, &) 


conformal with the partitioned vector 


— [ Mo 
. ea 
The elements of mw are estimable (equivalently Z, is full-rank) if and only if 
rank(G) = m. 


Proor. We prove this theorem for the special case in which G has m rows so that Gy 
is m X m. We partition A”! as 
K, G 
KS G)* /’ 
with submatrices conforming to the partitioning of A. Then 
Z, = (Wi, ok!) _Wiki. 
kK, 


Since W, is full-rank and each of its rows consists of one 1 with several Os, W,K; 
contains one or more copies of all of the rows of W,. Thus rank(Z,) = rank(KX}). 
Since A“! is nonsingular, ky! exists if KK} is full rank. If so, the product 


I O\/\/K G)\_ /(K Gi 
(ok 1 te c!) . ( 0 G- Ksk;-'¢;) 
is defined and is nonsingular by Theorem 2.4(4i). By Corollary | to Theorem 2.9b, 
G3 — K3Kj"'Gj is also nonsingular. But by equation (2.50), (G}—K3Kj! 
ay = G). Thus, if A~! is nonsingular, nonsingularity of Kj implies nonsingular- 
ity of Gy. Analogous reasoning leads to the converse. Thus Kj is full-rank if and only 
if G, is full-rank. Furthermore, Z, is full-rank if and only if rank(G,) = m. 


Example 15.4a. For the second-language data of Table 15.8, we test for the inter- 
action of native language and gender. There are two empty cells, and thus W is a 


436 ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 


TABLE 15.8 Comfort in Using English as a Second 
Language for Students at BYU-Hawaii* 


Native Gender 
Language Male Female 
Samoan 24 28 
3.20 3.38 
0.66 0.68 
Tongan 25 39 
3.03 3.10 
0.69 0.61 
Hawaiian 4 2 
3.47 3.13 
0.68 0.47 
Fijian 1 —_— 
3.79 — 
Pacific Islands English 26 49 
3.71 3.13 
0.58 0.73 
Maori S! 1 
4.07 3.04 
0.061 — 
Mandarin 15 43 
3.33 3.14 
0.74 0.61 
Cantonese — 21 
— 3.00 
— 0.54 


* Brigham Young University—Hawaii; data classified by gender and 
native language. Key to table entries: number of observations, mean, 
and standard deviation. 


281 x 16 matrix with two columns of 0. For the unconstrained model we use (15.48) 
to obtain 


SSE, = 113.235. 


Numbering the cells of Table 15.8 from | to 8 for the first column and from 9 to 16 
for the second column, we now define 


K 
A= fe) (15.56) 


15.4 TWO-WAY MODEL WITH EMPTY CELLS 437 


where 
1 1 21-1 ~=21 +3 TF FT FT tT 1 1 1 1 1 1 
1 21 21 =é21~=2~ +3 «TF Jt -1 -1 -1 -1 -1 -1 -1 -!1 
1-1 0 0 0 0 0 0 1-1 0 0 0 0 0 0 
0 0 1 0-1 0 0 0 0 0 1 0-1 0 0 +0 
K=!|0 0 0 1-1 0 0 0 0 0 0 1-1 1 =O O 
0 0 0 0-1 1 0 0 0 0 0 0-1 -1 O 0 
2 2 -1 -1 -1 -1 0 0 2 2 -1 -1 -1 -1 O O 
0 0 0 0 0 0 1-1 0 0 0 0 0 0-1 #1 
1 1 21 1 1 =1-3-3 +1 «21 ~=+1~ «1 ~21 ~21 -3 -3 
and 
1-!l 0 0 0 0 0 0 1-1 0 0 0 0 00 
0 O 1 0-1 0 0 0 0 0-1 0 1 0 00 
0 0 0 1-1 0 0 0 0 0 0-1 1 0 00 
G=|0 0 0 0-1 1 0 0 0 0 0 0 1-1 00 
2 2 -1 -1 -1 -1 0O O-2-2 1 1 1 1 00 
0 0 0 0 0 0 1-1 0 0 0 0 0 0-1 1 
1 1 1 21 +1 ~=1 -3 -3 1 -1 -1 -1 -1 3 3 


The overall mean and main effect contrasts are specified by K while interaction con- 
trasts are specified by G. Using (15.49), SSE, = 119.213. The full—reduced F test for 
additivity (15.50) yields the test statistic 


_ (119.213 — 113.235)/5 
119.213/267 


= 2.82, 


which is larger than the critical value of F's, 5,267 = 2.25. 
As an alternative approach to testing additivity, we impose the nonestimable side 


conditions fg) — fg — My +, =O and pg > — 4. — M2 + M., = 0 On the model 
by setting 


438 ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 


in (15.51) and 


1 -l1 0 0 0 0 0 0 -1 10000 0 0 
1 0-1 0 0 0 0 0 -1 01000 0 0 
1 0 0-1 0 0 0 0 -1 00100 0 0 
C=;1 0 0 0-1 0 0 0-1 00010 0 0 
1 0 0 0 0 -!t 0 0 -10000 10 0 
1 0 0 0 0 0-1 0 -100000 1 0 
1 0 0 0 0 0 0 -1 -1 000000 1 


in (15.54). The F statistic for the general linear hypothesis test of additivity (15.54) is 
again equal to 2.82. 

Since the interaction is significant for this dataset, we partially constrain the inter- 
action with contextually sensible estimable constraints in an effort to make all of the 
cell means estimable. We use A as defined in (15.56), but repartition it so that 


die hi sthe wih. sh. ah ie tly 0 Lb Db xT iI - Sy <i - OL 
1 ft 2d 1 tT 1 1 1 -1 -1 -1 -1 -1 -1 -1 -'!1 
1-1 0 0 0 0 0 0 1-1 0 0 0 0 0 0 
0 0 1 0-1 0 0 0 0 0 1 0-1 0 0 0 
0 0 0 1-1 0 0 0 0 0 0 1-1 0 0 0 
0 0 0 0-1 1 0 0 0 0 0 0-1 1 +0 =0 
K=;2 2-1 -1-1-1 0 0 2 2-1 -1-1-1 0 0O 
0 0 0 0 0 0 1-1 0 0 0 0 0 0 1 -!1 
1 1 1 1 1 1-3-3 21 21 21 «1 «1 ~=«1 --3 -3 
0 0 1 0-1 0 0 0 0 0-1 0 1 +0 0 =0 
0 0 0 0-1 1 0 0 0 0 0 0 1-1 0 +0 
2 2-1 -1-1-1 0 0-2-2 1 1 1 #1 ~=0 +0 
1 1 1 1 +21 1-3 -3 -1 -1 -1 -1 -1 -1 #3 = 3 
and 
1-100 000 0-110 000 00 
G=;0 001-100 0 000-110 00 
0 000 001-1 000 000-1 1 


The partial interaction constraints specified by Gy = 0 seem sensible in that they 
specify that the male—female difference is the same for Samoan and Tongan speak- 
ers, for Fijian and Hawaiian speakers, and for Mandarin and Cantonese speakers. 
Because the empty cells are the eighth and twelfth cells, we have 


PROBLEMS 439 


which obviously has rank = 2. Thus, by Theorem 15.4, all the cell means are 
estimable. Using (15.55) to compute the constrained estimates and (15.44) to 
compute their standard errors, we obtain the results in Table 15.9. 


TABLE 15.9 Estimated Mean Comfort in Using English as Second 
Language (with Standard Error) for Students at BYU-Hawaii* 


: Gender 

Native 

Language Male Female 
Samoan 3.23 (.11) 3.35 (11) 
Tongan 3.00 (.11) 3.12 (.09) 
Hawaiian 3.47 (.33) 3.13 (.46) 
Fijian 3.79 (.65) 3.20 (.67) 
Pacific Islands English 3.71 (.03) 3.13 (.09) 
Maori 4.07 (.38) 3.04 (.65) 
Mandarin 3.33 (.17) 3.14 (.10) 
Cantonese 3.19 (.24) 3.00 (.14) 


* On the basis of a constrained empty-cells model. 


PROBLEMS 


15.1 For the model y= Wy + « in (15.2.1), find W’W and W’y and show that 
(W'W) ! W’y = fas in (15.5). 
15.2 (a) Show that for the reduced model yj = + 6; in Section 15.3, 
SS(w) = Ny* as used in (15.6). 
(b) Show that SSB = ee Vii. — Ny* as in (15.6). 
(c) Show that (15.6) is equal to (15.7), that is, SSB = 0, ¥,y,— 
Ny. = Liye /ni — y./N. 
15.3. (a) Show that SSB in (15.9) is equal to SSB in (15.7), that is, ey ni; — 
¥.Y = Dk ye/m— P/N. 
(b) Show that SSE in (15.10) is equal to SSE in (15.8), that is, 
k Nj = k Nj k 
vie eet ij — yy = Vet pare Vij 4 yz /Ni. 
15.4 Show that F = (jc;v, )°/(s? >>, ¢7/n;) in (15.13) follows from (15.12). 
15.5 Show that a’ and b’ in (15.15) provide contrast coefficients that satisfy the 
property >, ajb;/n; = 0. 
15.6 Show that f = ¥ as in (15.19). 
15.7 Obtain (15.23) from (15.21); that is, show that (y — Wj)'(y — Wf) = 
a b Nij = 
et oe et Vik — Vy) 


440 


15.8 


15.9 


15.10 


15.11 


15.12 


15.13 


15.14 
15.15 


15.16 


15.17 


ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 


Obtain (15.24) from (15.23); that is, show that ee ijk — Vy) = 
(ni — 1)s}. 
Show that Ho: Bu =0, where B is given in (15.27), is equivalent to 
Ho: My + Mar = Miz + Man = M3 + My3 in (15.28). 

2 
Obtain F = (Sy, ws) /(° 3h ai;/n) in (15.31) from F = (a'ja)?/ 
[s2a’/(W'W)-!a] in (15.30). 
Evaluate a/(W’W)~!a in (15.29) or (15.30) for a’ = (1, 1, 1, —1, —1, —1). 
Use the W matrix for the 11 observations in the illustration in Section 15.3.1. 


Evaluate B(W’W) 'B’ in (15.33) for the matrices B and W used in the illus- 
tration in Section 15.3.1. 


Show that AA~'=I, where A= () as in (15.37) and A '= 
[K’(KK’) |, G’(GG’) |] as in (15.41). 


Obtain G* analogous to K* in (15.42). 


Show that cov(j,) = o?K'(KK’)~'(ZZ:) '(KK’)'K, thus verifying 
(15.44). 


Show that the numerator of the F statistic in (15.54) is invariant to the choice 
of a generalized inverse. 


In a feeding trial, chicks were given five protein supplements. Their final 
weights at 6 weeks are given in Table 15.10 (Snedecor 1948, p. 214). 
(a) Calculate the sums of squares in Table 15.1 and the F statistic in (15.11). 


(b) Compare the protein supplements using (unweighted) orthogonal con- 
trasts whose coefficients are the rows in the matrix 


32. S22 -S2.. 23 
0 1 —2 1 0 
0 1 0 -1 0 
1 0 oO oO -I 


Thus we are making the following comparisons: 


L,C versus So, Su, M 
So, M_ versus Su 

So versus M 

L versus C 


(c) Replace the second contrast with a weighted contrast whose coefficients 
satisfy )*; a;b;/n; = 0 when paired with the first contrast. Find sums of 
squares and F statistics for these two contrasts. 


15.18 


15.19 


15.20 


PROBLEMS 441 


TABLE 15.10 Final Weights (g) of Chicks at 6 Weeks 


Protein Supplement 


Linseed Soybean Sunflower Meat Casein 
309 243 423 325 368 
229 230 340 257 390 
181 248 392 303 379 
141 327 339 315 260 
260 329 341 380 404 
203 250 226 153 318 
148 193 320 263 352 
169 271 295 242 359 
213 316 334 206 216 
257 267 322 344 222 
244 199 297 258 283 
271 177 318 332 

158 

248 


(a) Carry out the computations to obtain 4, SSE, F',, Fg, and Fc in Example 
15.3a. 

(b) Carry out the computations to obtain jr., SSE,, F4,, and Fz. in Example 
15.3b. 

(c) Carry out the tests in parts (a) and (b) using a software package such as 
SAS GLM. 


Table 15.11 lists weight gains of male rats under three types of feed and two 
levels of protein. 


(a) Let factor A be level of protein and factor B be type of feed. Define a 
vector a corresponding to factor A and matrices B and C for factor B 
and interaction AB, respectively, as in Section 15.3.1. Use these to con- 
struct general linear hypothesis tests for main effects and interaction as 
in (15.29), (15.33), and (15.34). 

(b) Test the main effects in the no-interaction model (15.35) using the con- 
strained model (15.36). Define K and G and find f, in (15.43), SSE, 
and F for Ho: a’, = 0 and Ho: Bu, = 0 in (15.45). 

(c) Carry out the tests in parts (a) and (b) using a software package such as 
SAS GLM. 


Table 15.12 lists yields when five varieties of plants and four fertilizers were 
tested. Test for main effects and interaction. 


442 ANALYSIS-OF-VARIANCE: THE CELL MEANS MODEL FOR UNBALANCED DATA 


TABLE 15.11 Weight Gains (g) of Rats under 
Six Diet Combinations 


High Protein Low Protein 


Beef Cereal Pork Beef Cereal Pork 


73 98 94 90 107 49 
102 74 79 76 95 82 
118 56 96 90 97 73 
104 111 98 64 80 86 
81 95 102 86 98 81 
107 88 102 51 74 97 
100 82 72 106 
87 77 90 

86 95 

92 78 


Source: Snedecor and Cochran (1967, p. 347). 


TABLE 15.12 Yield from Five Varieties of Plants 
Treated with Four Fertilizers 


Variety 
Fertilizer 1 2 3 4 5 
1 57 26 39 23 48 
46 38 — 36 35 
— 20 — 18 — 
2 67 44 57 74 61 
72 68 61 47 — 
66 64 — 69 — 
3 95 92 91 98 78 
90 89 82 85 89 
89 95 
4 92 96 98 99 99 
88 95 93 90 — 
— — 98 98 — 


Source: Ostle and Mensing (1975, p. 368). 


16 Analysis-of-Covariance 


16.1 INTRODUCTION 


In addition to the dependent variable y, there may be one or more quantitative 
variables that can also be measured on each experimental unit (or subject) in an 
ANOVA situation. If it appears that these extra variables may affect the outcome 
of the experiment, they can be included in the model as independent variables (x’s) 
and are then known as covariates or concomitant variables. Analysis of covariance 
is sometimes described as a blend of ANOVA and regression. 

The primary motivation for the use of covariates in an experiment is to gain 
precision by reducing the error variance. In some situations, analysis of covariance 
can be used to lessen the effect of factors that the experimenter cannot effectively 
control, because an attempt to include various levels of a quantitative variable as a 
full factor may cause the design to become unwieldy. In such cases, the variable 
can be included as a covariate, with a resulting adjustment to the dependent variable 
before comparing means of groups. Variables of this type may also occur in exper- 
imental situations in which the subjects cannot be randomly assigned to treatments. 
In such cases, we forfeit the causality implication of a designed experiment, and 
analysis of covariance is closer in spirit to descriptive model building. 

In terms of a one-way model with one covariate, analysis of covariance will be 
successful if the following three assumptions hold. 


1. The dependent variable is linearly related to the covariate. If this assumption 
holds, part of the error in the model is predictable and can be removed to reduce 
the error variance. This assumption can be checked by testing Ho: 8 = 0, where 
B is the slope from the regression of the dependent variable on the covariate. 
Since the estimated slope B will never be exactly 0, analysis of covariance 
will always give a smaller sum of squares for error than the corresponding 
ANOVA. However, if B is close to 0, the small reduction in error sum of 
squares may not offset the loss of a degree of freedom [see (16.27) and a 
comment following]. This problem is more likely to arise with multiple covari- 
ates, especially if they are highly correlated. 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


443 


444 ANALYSIS-OF-COVARIANCE 


2. The groups (treatments) have the same slope. In assumption | above, a 
common slope # for all k groups is implied (assuming a one-way model 
with k groups). We can check this assumption by testing Ho: 8, = By = 
--+ = B,, where B; is the slope in the ith group. 

3. The covariate does not affect the differences among the means of the groups 
(treatments). If differences among the group means were reduced when the 
dependent variable is adjusted for the covariate, the test for equality of group 
means would be less powerful. Assumption 3 can be checked by performing 
an ANOVA on the covariate. 


Covariates can be either fixed constants (values chosen by the researcher) or 
random variables. The models we consider in this chapter involve fixed covariates, 
but in practice, the majority are random. However, the estimation and testing pro- 
cedures are the same in both cases, although the properties of estimators and tests 
are somewhat different for fixed and random covariates. For example, in the fixed- 
covariate case, the power of the test depends on the actual values chosen for the cov- 
ariates, whereas in the random-covariate case, the power of the test depends on the 
population covariance matrix of the covariates. 

As an illustration of the use of analysis of covariance, suppose that we wish to 
compare three methods of teaching language. Three classes are available, and we 
assign a class to each of the teaching methods. The students are free to sign up for 
any one of the three classes and are therefore not randomly assigned. One of the 
classes may end up with a disproportionate share of the best students, in which case 
we cannot claim that teaching methods have produced a significant difference in 
final grades. However, we can use previous grades or other measures of performance 
as covariates and then compare the students’ adjusted scores for the three methods. 

We give a general approach to estimation and testing in Section 16.2 and then 
cover specific balanced models in Sections 16.3—16.5. Unbalanced models are dis- 
cussed briefly in Section 16.6. We use overparameterized models for the balanced 
case in Sections 16.2—16.5. and use the cell means model in Section 16.6. 


16.2 ESTIMATION AND TESTING 


We introduce and illustrate the analysis of covariance model in Section 16.2.1 and 
discuss estimation and testing for this model in Sections 16.2.2 and 16.2.3. 
16.2.1 The Analysis-of-Covariance Model 
In general, an analysis of covariance model can be written as 
y=—Za+XB+e, (16.1) 


where Z contains Os and 1s, a contains and parameters such as aj, B;, and y; repre- 
senting factors and interactions (or other effects); X contains the covariate values; and 
B contains coefficients of the covariates. Thus the covariates appear on the right 


16.2 ESTIMATION AND TESTING 445 


side of (16.1) as independent variables. Note that Zq@ is the same as Xf in the 
ANOVA models in Chapters 12-14, whereas in this chapter, we use XP to represent 
the covariates in the model. 

We now illustrate (16.1) for some of the models that will be considered in this 
chapter. A one-way (balanced) model with one covariate can be expressed as 


Yj = M+ ajyt+ Pxy+ez;, 1=1,2,...,k, j=l,2,...,n, (16.2) 


where a; is the treatment effect, x;; is a covariate observed on the same sampling unit 
as yj, and f is a slope relating x; to yj. [If (16.2) is viewed as a regression model, then 
the parameters wt+a; i= 1,2,...,k, serve as regression intercepts for the k 
groups.] The kn observations for (16.2) can be written in the form 
y = Za+ XB + € as in (16.1), where 


1 10 .:-:. O X11 
he fy. Ok ‘i . 
_};1 1 £0 - 0 _ | | Xin 
Z=|, 01... 0) @ » X=x=/7"], (163) 
hous eg : OK : 
1s FOr 0.8595 oe 


and 6 = B. In this case, Z is the same as X in (13.6). 
For a one-way (balanced) model with g covariates, the model is 


Yip = M+ A+ Byxi + +++ + BorXiig + Ey; b= 1 Qe eg ky PS Nyce. 


(16.4) 
In this case, Z and @ are as given in (16.3), and XB has the form 

Nit XI2 +" X11 By 

X21 X22, +X 12g By 
Xp = : (16.5) 

Xknl = Xkn2 °° Xknq By 

For a two-way model with one covariate 

Yijk = + aj + 6 + Vij + Pxijx + Eijk, (16.6) 


Za has the form given in (14.4), and XB is 


X11 
X112 


XB=xp=| . |B. 
Xabn 


The two-way model in (16.6) could be extended to include several covariates. 


446 ANALYSIS-OF-COVARIANCE 


16.2.2 Estimation 


We now develop estimators of @ and £ for the general case in (16.1), 
y = Za+XB+ e. We assume that Z is less than full rank as in overparameterized 
ANOVA models and that X is full-rank as in regression models. We also assume that 


E(eé)=0 and cov(e) = ol. 


The model can be expressed as 


y=Za+XpB+e 
@.x( 4) + 
= (Z, E 
B 


= U0+e, 


where U = (Z, X) and 0= G 


. The normal equations for (16.7) are 


UU6=U, 


which can be written in partitioned form as 


Z Qa Z' 
(x) (5) =e) 
ZZ UX\(é& Zy 
& xx) (a) 7 xe)" 


We can express (16.8) as two sets of equations in @ and B: 


ZZa+ZXp=Z'y, 


X'Za@ + X'XB = X’y. 
Using a generalized inverse of Z'Z, we can solve for @& in (16.9): 
& = (Z'Z) Z'y — (Z'Z) ZXB 
= & — (Z'Z)-Z'XB, 


(16.7) 


(16.8) 


(16.9) 


(16.10) 


(16.11) 


where @ = (Z'Z) Z’y is a solution for the normal equations for the model 


y = Za + é€ without the covariates [see (12.13)]. 


To solve for B, we substitute (16.11) into (16.10) to obtain 


X'Z[(Z'Z)-Z'y — (Z'Z)- Z'XB| + X'XB = X'y 


16.2 ESTIMATION AND TESTING 447 


or 


X’L(Z'Z) Z'y + X' [I — LZ'Z) Z XB = X’y. (16.12) 
Defining 


P=ZZZ) Z, (16.13) 


we see that (16.12) becomes 
xd - P)XB = X'y — X’'Py = X'(1- Py. 


Since the elements of X typically exhibit a pattern unrelated to the Os and Is in Z, we 
can assume that the columns of X are linearly independent of the columns of Z. Then 


X’(I — P)X is nonsingular (see Problem 16.1), and a solution for B is given by 


B= (X'd- Px} 'X'— Phy (16.14) 

_ F-l 
where = Ey ey, (16.15) 
Ey. =X’(I—P)X and ey, = X’/(1— Poy. (16.16) 


For the analysis-of-covariance model (16.1) or (16.7), we denote SSE as SSE).,. 
By (12.20), SSE,., can be expressed as 


i i Z'y 
SSE). = y'y — 0'Uy =y'y—(@,B)| -, 
Xy 


=yy—@Z'y— p’X’'y 
= yy — [a — BX'ZZZ) IZ — BX'y [by (16.11)] 


= yy — @Z'y — BX’ Z(Z'Z) Z'ly 
= SSE, — px'd — P)y, (16.17) 
where Gp is as defined in (16.11), P is defined as in (16.13), and SSE, = y’y — a, Z'y 


is the same as the SSE for the ANOVA model y = Za + eé without the covariates. 
Using (16.16), we can write (16.17) in the form 


SSEy.. = eyy — En ey, (16.18) 


xy KX 


where 


eyy = SSEy = y'(1— Phy. (16.19) 


448 ANALYSIS-OF-COVARIANCE 


In (16.18), we see the reduction in SSE that was noted in the second paragraph of 
Section 16.1. The proof that E,,. = X’(I — P)X is nonsingular (see Problem 16.1) can 
be extended to show that E,, is positive definite. Therefore, e,E, €, > 0, and 
SSE)... < SSE). 


16.2.3 Testing Hypotheses 


In order to test hypotheses, we assume that € in (16.1) is distributed as N,,(0, oD), 
where n is the number of rows of Z or X. Using the model (16.7), we can express a 
hypothesis about @ in the form Hp: C@ = 0, where C = (C,, O), so that Hy becomes 


Qa 


Hi(C.0)( F 


) =0 or Ay: Cia=0. 


We can then use a general linear hypothesis test. Alternatively, we can incorporate the 
hypothesis into the model and use a full—reduced-model approach. 
Hypotheses about 6 can also be expressed in the form Hp: C0 = 0: 


Qa 


Hy):C@ = (O, C>) ( B 


)=0 or Hy:C,p=0. 


A basic hypothesis of interest is Hp : B = 90, that is, that the covariate(s) do not belong 
in the model (16.1). In order to make a general linear hypothesis test of Hp: B = 0, we 


need cov(f), where f is given by (16.14) as B = [X/(I — P)X]-!X’(I — P)y. Since 
I — P is idempotent (see Theorems 2.13e and 2.13f), cov(B) can readily be found 
from (3.44) as 


cov() = [X’(I — P)X]7!X/(1 — P)o (1 — P)X[X'(I — P)X]"! 
= 0° [X’(I— P)x}!. (16.20) 


Then SSH for testing Hp: B = 0 is given by Theorem 8.4a(ii) as 
SSH = p’X'(1— P)XB. (16.21) 
Using (16.16), we can express this as 


SSH = el. E,'e,,. (16.22) 


xy A XX 


Note that SSH in (16.22) is equal to the reduction in SSE due to the covariates; see 
(16.17), (16.18), and (16.19). 


16.3 ONE-WAY MODEL WITH ONE COVARIATE 449 


We now discuss some specific models, beginning with the one-way model in 
Section 16.3. 


16.33. ONE-WAY MODEL WITH ONE COVARIATE 


We review the one-way model in Section 16.3.1, consider estimators of parameters in 
Section 16.3.2, and discuss tests of hypotheses in Section 16.3.3. 


16.3.1 The Model 


The one-way (balanced) model was introduced in (16.2): 
Yj = M+ ajyt Pxy+ez;, t= 1,2,...,k, f=1,2,...,n. (16.23) 
All kn observations can be written in the form of (16.1) 
y=Za+XfP+e=Za+xB+e, 


where Z, a, and x are as given in (16.3). 


16.3.2 Estimation 
By (16.11), (13.11), and (13.12), an estimator of @ is obtained as 


& = & — (ZZ) Z/XB = & — (ZZ) Z'xB 


0 0 0 
yi. BX. 1. an PX, 

— | ¥. | _ | Be. | = | 2. — Be. (16.24) 
Ve. Bx Jy, — BX. 


(see Problem 16.4). In this case, with a single x, E,,. and e,, reduce to scalars, along 
with éyy: 


k n 

= \2 

Ey. = Cy = s s (xij — Xi.) > 
i=l j=l 


Oy = by = > Cy — F.) (V9 — Hs (16.25) 


ij 


eyy = 2 (yj — yi) 
iy 


450 ANALYSIS-OF-COVARIANCE 


Now, by (16.15), the estimator of 6 is 


a i (xij — Xi. )Yii — Y;,) 1 
== ‘ 6.26 
p xx i (xi _ x) : 


By (16.18), (16.19), and the three results in (16.25), SSE,., is given by 
) 


+. ! —1 _ ery 
SSEyx = Gy — Egle = ey — = 
Z wee: 
g Oy — Xi) (Ya — Yi) 
=> 0-91” [Ly ae id] : (16.27) 
ij i (xij — Xi.) 


which has k(n — 1) — 1 degrees of freedom. Note that the degrees of freedom of 
SSE,., are reduced by 1 for estimation of 6, since SSE, = ey has k(n — 1) degrees 
of freedom and e, /ex: has 1 degree of freedom. In using analysis of covariance, 
the researcher expects the reduction from SSE, to SSE,., to at least offset the loss 
of a degree of freedom. 


16.3.3 Testing Hypotheses 


For testing hypotheses, we assume that the e,;’s in (16.23) are independently distrib- 
uted as N(0, 0”). We begin with a test for equality of treatment effects. 


16.3.3.1 Treatments 
To test 


Aoi: a) = a2 = +++ = OK 


adjusted for the covariate, we use a full—reduced-model approach. The full model is 
(16.23), and the reduced model (with a; = a7 = --- = ay, = a) is 


Vij = Mt + PX + i 
=p'+Bxyt+ey, i= 1,2,...,.k4 f=1,2,....0. 16.28) 


This is essentially the same as the simple linear regression model (6.1). By (6.13), 
SSE for this reduced model (denoted by SSE,qa) is given by 


ko [dy Gy — BE )Oy — FD] 
SSEva = Ges) el ce (16.29) 
d SPs vi —¥., Sy Gy =P 


which has kn — 1 — 1 = kn — 2 degrees of freedom. 


16.3 ONE-WAY MODEL WITH ONE COVARIATE 451 


Using a notation adapted from Sections 8.2, 13.4, and 14.4, we express the sum of 
squares for testing Ho; as 


SS(a|, B) = SS(u, a, B) — SSH, B). 
In (16.27), SSE,., is for the full model, and in (16.29), SSE,q is for the reduced model. 
They can therefore be written as SSE), =y’y—SS(u,a,8) and 
SSEva = y’y — SS(u, B). Hence 
SS(a|u, B) = SSE — SSE).x, (16.30) 


which has kn — 2 — [k(n — 1) — 1] = k — 1 degrees of freedom. The test statistic for 
Ho\: a) = Q2 = +++ = ax is therefore given by 


— SS(alp, B)/(k — 1) 
e SSE,../[k(n — 1) — 1]’ (16.31) 


which is distributed as F[k — 1,k(n — 1) — 1] when A, is true. 
By (16.30), we have 


SSEra = SS(a| 4, B) + SSE).y. 


Hence, SSE,q functions as the “total sum of squares” for the test of treatment effects 
adjusted for the covariate. We can therefore denote SSE; by SST,.,, so that the 
expression above becomes 


SST,.. = SS(a|u, B) + SSE)... (16.32) 
To complete the analogy with SSE,., = ey, — OS / ex: in (16.27), we write (16.29) as 


ve 
SST yx = by — 2, (16.33) 


XX 


where 
SST, = SSEra, hy = (99 — 9.)'s by = 0 Oy — E)OW - Fs 
ij ij 


BS Op Aa) (16.34) 
ij 


Note that the procedure used to obtain (16.30) is fundamentally different from that 
used to obtain SSE,.. and SSE,q in (16.27) and (16.29). The sum of squares 
SS(a|u, B) in (16.30) is obtained as the difference between the sums of squares 


452 ANALYSIS-OF-COVARIANCE 


for full and reduced models, not as an adjustment to SS(a|w) = n >>, (;, — y.)? in 
(13.24) analogous to the adjustment used in SSE,., and SST)., in (16.27) and 
(16.33). We must use the full-reduced-model approach to compute SS(aly, B), 
because we do not have the same covariate values for each treatment and the 
design is therefore unbalanced (even though the n values are equal). If SS(a|p, B) 
were computed in an “adjusted” manner as in (16.27) or (16.33), then 
SS(a|u, B) + SSE).. would not equal SST,,., as in (16.32). In Section 16.4, we will 
follow a computational scheme similar to that of (16.30) and (16.32) for each term 
in the two-way (balanced) model. 

We display the various sums of squares for testing Ho: a, = aj =--: = ay in 
Table 16.1. 


16.3.3.2 Slope 
We now consider a test for 


Ao2 5 B = 0. 
By (16.22), the general linear hypothesis approach leads to SSH = e’ E;, ex for 


XY" AX 
testing Ho: B = 0. For the case of a single covariate, this reduces to 


e. 
SSH =—, (16.35) 


XX 


where e,, and e,, are as found in (16.25). The F statistic is therefore given by 


Be es 
ne xy 
os SSE).x/[k(n — 1) — 1)’ 


(16.36) 
which is distributed as F[1, k(n — 1) — 1] when Hop is true. 


16.3.3.3 Homogeneity of Slopes 

The tests of Ho}: a} = @2 = ++: = a, and Hon: B = 0 assume a common slope for all 
k groups. To check this assumption, we can test the hypothesis of equal slopes in the 
groups 


A3:B; = Bo = +++ = Be (16.37) 


where £; is the slope in the ith group. In effect, Ho3 states that the k regression lines are 
parallel. 


TABLE 16.1 Analysis of Covariance for Testing Ho: a; = a2 =--- = a, in 

the One-Way Model with One Covariate 

Source SS Adjusted for Covariate Adjusted df 
Treatments SS(a|u, B) = SST).. — SSEy.x k-1 

Error SSE). = @yy — eh fee kn —1)-1 


Total SST). = hy — 2,/tex kn — 2 


16.3 ONE-WAY MODEL WITH ONE COVARIATE 453 


The full model allowing for different slopes becomes 
yg = Mt at Bxjytej; 1=1,2,...,k, j=1,2,...,n. (16.38) 
The reduced model with a single slope is (16.23). In matrix form, the nk observations 


in (16.38) can be expressed as y = Za+ Xf + e, where Z and a are as given in 
(16.3) and 


x, 0 -: 0 B, 
0 Xo cee 0 B, 
Xp = ; J 4, (16.39) 
0 0 ore Xx B, 
with x; = (%j1,Xi2, .--,Xin)’. By (16.14) and (16.15), we obtain 
B= Ez ey = [X’d— PX}! X'(— Phy. 
To evaluate E,, and e,,, we first note that by (13.11), (13.25), and (13.26) 
I-P=1-22Z)Z 
1 
I--J O vee O 
n 
1 
O I--J .::: O 
= n ; (16.40) 
1 
O O + JT--J 
n 
where I in I — P is kn x kn and Tin I — (1/n)J is n x n. Thus 
‘ 1 
x, (I-—J)x) 0 “a 0 
n 
1 
0 x3(I--J)x 0 
E,, = X'(1— P)X = ie 
0 0 x, (I _ —J) x, 
n 
yj Ory —%x,) 0 ae 0 
0 dj yj — 2)? + 0 


= (16.41) 


0 0 ie 3 Gig ae 


454 ANALYSIS-OF-COVARIANCE 


Cxx,1 0 0 
0 Cxx,2 0 
= eo As (16.42) 
0 0 Cxx,k 


where €y.j = > (xij — x)’. To find xy, We partition y as y = (yj, Y>,---, Vz) 
where y; = (yi, Yi2,---, Yin). Then 


ly = x'(I _ P)y 


1 
I--J O -- O 
x, O -:- 0 se , Yi 
oxy: 0 OO Tsetse [oO Y2 
= n 
0 Oo. x Ye 
O 6:0 2% Bag 
n 
x/(I-—J)y, 
1 
x,(1-—J)y¥, 
n 
Oy — MIO — 1) 
Do 245 — Xo.) V2 — Vo.) 
_ (16.43) 
dj Og — Xe Ye — Ye.) 
exy,1 
Cxy,2 
= : ; (16.44) 
Cxyk 


16.3. ONE-WAY MODEL WITH ONE COVARIATE 455 


where é,y,; = > (xij — Xi.)( yx — Y;,). Then, by (16.15), we obtain 


exy,1 / €xx1 


K Si ery,2/€xx2 
B= EQ Gy = . (16.45) 


Cxy,e/ Cxx,k 


By analogy with (16.30), we obtain the sum of squares for the test of Ho3 in 
(16.37) by subtracting SSE,., for the full model from SSE,., for the reduced 
model, that is, SSE(R),.. — SSE(F),... For the full model in (16.38), SSE(F),., is 
given by (16.18), (16.44), and (16.45) as 


SSE(F)y.x = Cy — Cy Ex Gy = Gy — CyB 


xr KX 
exy,1 / xx, 1 
Cxy,2 / Cxx,2 
= yy — (€xy,15 Cxy2o+++5 Cxy,k) 
exy,k / Cxx,k 
k e ; 
xy, i 
=ey-> —™, (16.46) 
iI Oxui 


which has k(n — 1) — k = k(n — 2) degrees of freedom. The reduced model in which 
Ao3: By = By = +: = By = B is true is given by (16.23), for which SSE(R),.,. is 
found in (16.27) as 


2 
e... 
SSE(R)y.. = ey — = (16.47) 


which has k(n — 1) — | degrees of freedom. Thus, the sum of squares for testing H3 is 


k e. : e 
SSE(R)y.. — SSE(F),.. = So * - =, (16.48) 


= ex, i ex 


which has k(n — 1) — 1 — k(n — 2) = k — 1 degrees of freedom. The test statistic is 


[Shi S lees — Bleed Mk D 
= SSE(F),../k(n — 2) , (16.49) 


which is distributed as F[A — 1, k(n — 2)] when Ao3: B; = B. = --: = f, is true. 


456 


If the hypothesis of equal slopes is rejected, the hypothesis of equal treatment 
effects can still be tested, but interpretation is more difficult. The problem is some- 
what analogous to that of interpretation of a main effect in a two-way ANOVA in 
the presence of interaction. In a sense, the term 6;x; in (16.38) is an interaction. 
For further discussion of analysis of covariance with heterogeneity of slopes, see 


ANALYSIS-OF-COVARIANCE 


TABLE 16.2 Maturation Weight and Initial Weight (mg) of Guppy Fish 
Feeding Group 


1 2 3 
y x y x y x 
49 35 68 33 59 33 
61 26 70 35 a3 36 
55 29 60 28 54 26 
69 32 53 29 48 30 
51 23 59 32 54 33 
38 26 48 23 D3 25 
64 31 46 26 37 23 


Reader (1973) and Hendrix et al. (1982). 


Example 16.3. To investigate the effect of diet on maturation weight of guppy fish 
(Poecilia reticulata), three groups of fish were fed different diets. The resulting 
weights y are given in Table 16.2 (Morrison 1983, p. 475) along with the initial 


weights x. 


We first estimate 8, using x as a covariate. By the three results in (16.25), we have 


€xy = 350.2857, ey = 412.71429, ey, = 1465.7143. 


Then by (16.26), we obtain 


We now test for equality of treatment means adjusted for the covariate, 


~  €xy 412.7143 


~e, 350.2857 ee 


Ho: a; = a2 = a3. By (16.27), we have 


e (412.7143)? 
GR ey Aes Tie 
aes art 350.2857 
= 979.4453 


with 17 degrees of freedom. By (16.29) and (16.33), we have 


SST,.. = 1141.4709 


16.4 TWO-WAY MODEL WITH ONE COVARIATE 457 


with 19 degrees of freedom. Thus by (16.30), we have 


SS(a|p, B) = SST). — SSEy.. = 1141.4709 — 979.4453 
= 162.0256 


with 2 degrees of freedom. The F statistic is given in (16.31) as 


SS(a| mu, B)/(k — 1) 162.0256/2 


= — = 1.4061. 
SSE)../[k(n— 1)— 1] 979.4453/17 
The p value is .272, and we do not reject Hy: a; = a2 = a3. 
To test Ho: B = 0, we use (16.36): 
yes et, /€xx __ 412.7143)" /350.2857 
SSE).x/[k(n — 1) — 1] 979.4453 /17 


= 8.4401. 


The p-value is .0099, and we reject Ho : B = 0. 
To test the hypothesis of equal slopes in the groups, Ho: 8B, = By = B3, we first 
estimate B,, B,, and B; using (16.45): 


B, = .7903, B, = 1.9851, B; = .8579. 


Then by (16.46) and (16.47), 


SSE(F),.. = 880.5896, SSE(R),.. = 979.4453. 


The difference SSE(R),., — SSE(F),., is used in the numerator of the F statistic in 
(16.49): 


__ (979.4453 — 880.5896) /2 


880.5896/(3)(5) te 


The p value is .450, and we do not reject Ho: B; = Bo = Bs. 


16.4 TWO-WAY MODEL WITH ONE COVARIATE 


In this section, we discuss the two-way (balanced) fixed-effects model with one 
covariate. The model was introduced in (16.6) as 


ijk = w+ aj t+ yj + Oy + Brin + Fix; (16.50) 
PS aa Gy Sy Zeer pth Wye2peeag 


458 ANALYSIS-OF-COVARIANCE 


where a; is the effect of factor A, 7; is the effect of factor C, 6, is the AC interaction 
effect, and x; is a covariate measured on the same experimental unit as yjjx. 


16.4.1 Tests for Main Effects and Interactions 


In order to find SSE,.,, we consider the hypothesis of no overall treatment effect, that 
is, no A effect, no C effect, and no interaction (see a comment preceding Theorem 
14.4b). By analogy to (16.28), the reduced model is 


ijk = MW” + BYije + Six. (16.51) 


By analogy to (16.29), SSE for the reduced model is given by 


2 
ee 8g (Sew - 2.00% -5..) 
SSEv = S- ye os (vik — YD a One — xy 


i=l j=l k=l 


2 be (ijk — XC Vik — om) : 


2 y 
= Vij = (16.52) 
» uk acn Sn Orie — ¥..)° 
By analogy to (16.27), SSE for the full model in (16.50) is 
[ie ie — FIO —Iy] 
ijk ijk ~ Xi Vik ~ Vij. 
SSEy.x =) (vie — Fy. - 
: y a Se Dige Crik — Hy)? 
Ty Orn — Hy | 
2 sik ijk — Xij Vik — Viz 
2 Vij. ijk “tk ij Di ij. 
= 16. 
Deo Ds, , (16.53) 


=..\2 
iik i ik ik — Xy.) 


which has ac(n — 1) — 1 degrees of freedom. Note that the degrees of freedom for 
SSE)., have been reduced by | for the covariate adjustment. 
Now by analogy to (16.30), the overall sum of squares for treatments is 


SS(a, y, 6|, B) = SSExa — SSE). 


2 
yi [Si Cin — Fy.) vi — Fy) 
— n  acn ik (ijn — coe 


UT] 


[ye Ge — Fm 5] 


3 16.54 
ik (Xijk ~~ 5 ‘ 


which has ac — | degrees of freedom. 


16.4 TWO-WAY MODEL WITH ONE COVARIATE 459 


Using (14.47), (14.69), and (14.70), we can partition the term V5" —y* /acn 
in (16.54), representing overall treatment sum of squares, as in (14.40): 


2 2 
a i aero Se 
d, - cae _) tan 3, cm) 
+n 0 (Fy. —F..- Fy +9 
; 
= SSA, + SSC, + SSAC). (16.55) 


To conform with this notation, we define 


SSE, = $0 (vie — 4)’. 
ijk 


We have an analogous partitioning of the overall treatment sum of squares for x: 


x2 x2 
\ (4 — = = SSA, + SSC, + SSAC,, (16.56) 
n 


— acn 
ij 
where, for example 


SSA, = cn )° &%.— 3%). 
i=1 


We also define 


SSE, = S > Gin — 4.) 


iik 


The “overall treatment sum of products” yi Xi Yi. /n — x..y../acn can be parti- 
tioned in a manner analogous to that in (16.55) and (16.56) (see Problem 16.8): 


XG Vij Xo a. an Be a ee aah 
2 mo on) Gi. x M5, ¥.)+an > Gy x. MF;, —F..) 


+n ° Gy. —%. — jp + EI, — Ip, — Fj +3.) 
ij 


= SPA + SPC + SPAC. (16.57) 


460 ANALYSIS-OF-COVARIANCE 


We also define 


SPE = oS (Xie — Xi Vik — Vy 
ijk 


We can now write SSE,., in (16.53) in the simplified form 


(SPE) 


SSE, = SSE, — 


We display these sums of squares and products in Table 16.3. 

We now proceed to develop hypothesis tests for factor A, factor C, and the inter- 
action AC. The orthogonality of the balanced design is lost when adjustments are 
made for the covariate [see comments following (16.34); see also Bingham and 
Feinberg (1982)]. We therefore obtain a “total” for each term (A, C, or AC) by 
adding the error SS or SP to the term SS or SP for each of x, y and xy (see the 
entries for A+ EF, C+ £, and AC + E in Table 16.3). These totals are analogous 
to SST).. = SS(a@|p, B) + SSE,., in (16.32) for the one-way model. The totals are 
used to obtain sums of squares adjusted for the covariate in a manner analogous to 
that employed in the one-way model [see (16.30) or the “treatments” line in 
Table 16.1]. For example, the adjusted sum of squares SSA,., for factor A is obtained 
as follows: 


SS(A + E),, = SSA, + SSE (SPA + SPE) (16.58) 
aa y ” SSA, + SSE,’ ; 
(SPE) 
E,., = SSE, — , 16. 
SSE)., = SSE, SSE, (16.59) 
SSA). = SS(A + E)y., — SSE). (16.60) 


From inspection of (16.58), (16.59), and (16.60), we see that SSA, has a— | degrees 
of freedom. The statistic for testing Ho: a; = a2 = +--+ = Mg, corresponding to the 


TABLE 16.3 Sums of Squares and Products for x and y in a Two-Way Model 


SS and SP Corrected for the Mean 


Source y x xy 

A SSA, SSA, SPA 

Cc SSC, SSC, SPC 

AC SSAC, SSAC, SPAC 
Error SSE), SSE, SPE 
A+E SSA, + SSE, SSA, + SSE, SPA + SPE 
C+E SSC, + SSE, SSC, + SSE, SPC + SPE 
AC+E SSAC, + SSE, SSAC, + SSE, SPAC + SPE 


16.4 TWO-WAY MODEL WITH ONE COVARIATE 461 


TABLE 16.4 Value of Crops y and Size x of Farms in Three Iowa Counties 


County 
1 2: 3 
Landlord— 
Tenant y x y x y x 
Related 6399 160 2490 90 4489 120 
8456 320 5349 154 10026 245 
8453 200 5518 160 5659 160 
4891 160 10417 234 5475 160 
3491 120 4278 120 11382 320 
Not related 6944 160 4936 160 5731 160 
6971 160 7376 200 6787 173 
4053 120 6216 160 5814 134 
8767 280 10313 240 9607 239 
6765 160 5124 120 9817 320 


Source: Ostle and Mensing (1975, p. 480). 


main effect of A, is then given by 


SSAj.x/(a — 1) 


~ SSE).x/lac(n — 1) — 1] (16.61) 


which is distributed as F[a — 1, ac(n — 1) — 1] if Ho, is true. Tests for factor C and 
the interaction AC are developed in an analogous fashion. 


Example 16.4a. In each of three counties in Iowa, a sample of farms was taken from 
farms for which landlord and tenant are related and also from farms for which 
landlord and tenant are not related. Table 16.4 gives the data for y= value of 
crops produced and x = size of farm. 

We first obtain the sums of squares and products listed in Table 16.3, where factor 
A is relationship status and factor C is county. These are given in Table 16.5, where, 


TABLE 16.5 Sums of Squares and Products for x and y 


SS and SP Corrected for the Mean 


Source y x xy 

A 2,378,956.8 132.30 17,740.8 

Cc 8,841,441.3 7724.47 249,752.8 
AC 1,497,572.6 2040.20 41,440.3 
Error 138,805,865 106,870 3,427,608.6 
A+E 141,184,822 107,002.3 3,445,349.4 
C+E 147,647,306 114,594.5 3,677,361.4 


AC+E 140,303,437 108,910.2 3,469,048.9 


462 ANALYSIS-OF-COVARIANCE 


for example, SSA, = 2378956.8, SSA, + SSE, = 141,184,822, and SPAC + 
SPE = 3,469,048.9. 
By (16.58), (16.59), and (16.60), we have 
SS(A + E),.. = 30,248,585, SSE). = 28,873,230, 
SSA). = 1,375,355.1. 


Then by (16.61), we have 
SSA,../(a — 1) 
SSE,../[ac(n — 1) — 1] 


__ 1,375,355.1/1 _ 1,375,355.1 
~ 28,873,230/23  1,255,357.8 


= 1.0956. 


The p value is .306, and we do not reject Hp: a; = ap. 
Similarly, for factor C, we have 


__ 766,750.1/2 


=e = 23054 
1,255,357.8 sae 
with p = .740. For the interaction AC, we obtain 
a 932,749.5/2 — 3715 
1,255,357.8 


with p = .694. 


16.4.2 Test for Slope 


To test the hypothesis Ho2: 8 = 0, the sum of squares due to B is (SPE)? /SSE,, and 
the F statistic is given by 
_ (SPE)’/SSEy 
SSE).x/[ac(n — 1) — 1]’ 


(16.62) 


which (under Ho2 and also Ho3 below) is distributed as F[1, ac(n — 1) — 1]. 


Eample 16.4b. To test Ho: 8 = 0 for the farms data in Table 16.4, we use SPE and 
SSE, from Table 16.5 and SSE,,., in Example 16.4a. Then by (16.62), we obtain 
___ (SPE)’/SSE, 
SSE).,/[ac(n — 1) — 1] 
__ (3,427,608.6)" /106,870 
1,255,357.8 


= 87.5708. 


The p value is 2.63 x 10~°, and Ho: B = 0 is rejected. 


16.4 TWO-WAY MODEL WITH ONE COVARIATE 463 


16.4.3 Test for Homogeneity of Slopes 


The test for homogeneity of slopes can be carried out separately for factor A, factor C, 
and the interaction AC. We describe the test for homogeneity of slopes among the 
levels of A. The hypothesis is 


Ho3: By = By = --- = B,; 


that is, the regression lines for the a levels of A are parallel. The intercepts, of course, 
may be different. To obtain a slope estimator 6, for the ith level of A, we define SSE, 
and SPE for the ith level of A: 


SSE,; = S0 $0 Gin — 54), SPE: = S0 rie — HOR — Fy) (16.63) 
jk 


j=l k=l 
Then B; is obtained as 


» _ SPE; 
'” SSEx;’ 


and the sum of squares due to B; is (SPE,)* /SSExi- 
By analogy to (16.46), the sum of squares for the full model in which the 6,’s are 
different is given by 


“\ (SPE;)” 


SS(F) = SSE, — 
’ » SSE, 


and by analogy to (16.47), the sum of squares in the reduced model with a common 
slope is 


(SPE)* 
R) = SSE, — ? 
SS(R) = SSE, SSE, 
Our test statistic for Ho3: 8B, = B, = --- = B, is then similar to (16.49): 


[SS(R) — SS(F)]/(a — 1) 
SS(F)/[ac(n — 1) — 1] 
_ [Ohi (SPE;)’ /SSE,, — (SPE)’/SSE,] /(a — 1) 
[SSE, — ~., (SPE,)?/SSE,j]/[ac(n — 1) — a]’ 


(16.64) 


which (under Ho3) is distributed as F[a — 1,ac(n — 1) — a]. The tests for homo- 
geneity of slopes for C and AC are constructed in a similar fashion. 


464 ANALYSIS-OF-COVARIANCE 


Example 16.4c. To test homogeneity of slopes for factor A, we first find Bi and Bp 
for the two levels of A: 


. _ SPE; _ 2,141,839.8 


= 34.9066, 
1 SSEx1  61,359.2 
: SPE, _1,285,768.8 
- = 28.2519. 
2 SSE,2 —-45,510.8 
Then 
2 (SPE; 
SS(F) = SSE, ~— = 27,716,088.7 
( ) ‘y 3 SSExj ’ > > 
(SPE) 
SS(R) = SSE, = 28,873,230. 
( ) ‘y SSE, > 2 


The difference is SS(R)—SS(F) = 1,157,140.94. Then by (16.64), we obtain 


1,157,140.94/1 
ah ot | _ oigs. 
27,716,088.7 /22 


The p value is .348, and we do not reject Ho: By = Bo. 
For homogeneity of slopes for factor C, we have 


B, = 23.2104, f) = 50.0851, B; = 31.6693, 


_ 9,506,034.16/2 
19,367,195.5/21 


= 5.1537 


with p = .0151. 


16.55 ONE-WAY MODEL WITH MULTIPLE COVARIATES 


16.5.1 The Model 


In some cases, the researcher has more than one covariate available. Note, however, 
that each covariate decreases the error degrees of freedom by 1, and therefore the 
inclusion of too many covariates may lead to loss of power. 

For the one-way model with qg covariates, we use (16.4): 


Yip = M+ OG + Bix + BoXir +--+ + BoXiiq + Fi 


— e+ a; + Bx; + ey, (16.65) 
aa ie irae es Boney 


16.5 ONE-WAY MODEL WITH MULTIPLE COVARIATES 465 
where p’ = (Bi, Bo, ---, By) and xj = (j1,Xj2, ---, Neg): For this model, we wish to 
test Ho): @) = a2 = --- = ax and Hy: B = 0. We will also extend the model to allow 


for a different B vector in each of the k groups and test equality of these B vectors. 
The model in (16.65) can be written in matrix notation as 


y=Za+ XB e, 


where Z and a are given following (16.3) and X® is as given by (16.5): 


Mir X12 ++ X11g By 
M121. -¥122, °° X12 Bo 
XB = 
Xknl = -Xkn2 °°" Xknq Bo 


The vector y is kn x | and the matrix X is kn x g. We can write y and XP in parti- 
tioned form corresponding to the k groups: 


Yi X; 
Y2 X2 
Y= . |, XP= . IB (16.66) 
Yk Xx 
where 
Jil Mil X12 "'° Xilg 
Yi2 X21 X22 Xi2g 
y;= and X; = . : 
Yin Xint Xin2d °°" Xing 


16.5.2 Estimation 


We first obtain E,,, €,,, and é,, for use in B and SSE).,. By (16.16), E,, can be 
expressed as 


Ey, = X/(I— P)X. 


Using X partitioned as in (16.66) and I— P in the form given in (16.40), E,,. becomes 


k 
1 
E,. = es (1 = =) XxX; (16.67) 


466 ANALYSIS-OF-COVARIANCE 


(see Problem 16.10). Similarly, using y partitioned as in (16.66), e,, is given by 
(16.16) as 


k 

1 

ey =X/-Pyy= 5X! (1 = -) y;- (16.68) 
i=1 


By (16.19) and (16.40), we have 


k 
1 
eyy = y (I— P)y = > y; (1 = 73) y;- (16.69) 


i=l 


The elements of E,,, €,, and é,, are extensions of the sums of squares and products 
found in the three expressions in (16.25). 

To examine the elements of the matrix E,,, we first note that I— (1/n)J is 
symmetric and idempotent and _ therefore XU — (1/n)J)|X; in (16.67) can be 
written as 


X/(1 = (1/n)J)X; = X01 — (1/n)d)' = (1 /n))X; 


(16.70) 
1 
aad XXcis 
where X,; = [I — (1/n)J]X; is the centered matrix 
Kit — Xi XHN2—Xi2 +++ Xilg — Xig 
Xi21 — Xin X22 —Xi2 +++ Xing — Xig 
X= (16.71) 
Xint — Xia = Xin2—%i.2 +7 Xing — Xig 


[see (7.33) and Problem 7.15], where x;2, for example, is the mean of the second 
column of X,, that is, x;2 = ye 1 XG /n. By Theorem 2.2c(i), the diagonal elements 
of X/,X.; are 


SS" Gir —FirP, = 1,2... (16.72) 
j=l 

and the off-diagonal elements are 
a (Xijr m7 Xi.r) Xijs — Xis)s rF#s. (16.73) 
j=l 


By (16.67) and (16.72), the diagonal elements of E,,. are 


k n 
SO Gay FH Lea ag (16.74) 
1 


i=l j= 


16.5 ONE-WAY MODEL WITH MULTIPLE COVARIATES 467 


and by (16.67) and (16.73), the off-diagonal elements are 
k n 
S25 Gir — 5 Gis — Fis), TS. (16.75) 


i=1 j=l 


These are analogous to é,, = ar (xij — cag in (16.25). 
To examine the elements of the vector e,,, we note that by an argument similar to 
that used to obtain (16.70), X)[I — (1/n)J]y; in (16.68) can be written as 


XU — (1/n)Dy,; = XI — (1/n)J]' TE — /n)s ly; = Xoo 


where X.; is as given in (16.71) and 


vil Yi. 

Yi2 — Yi. 
Yci = : 

Yin — Yi. 


with y;, = )7"_, y/n. Thus the elements of X/jy,; are of the form 


n 
(Xijr XV y,) r= i a ere 
j=1 


j 
and by (16.68), the elements of e,, are 
k n 
Soo Gr - NOG - FH) TH LZ.. 


i=l j=l 


Similarly, e,, in (16.69) can be written as 


(16.76) 


k 
=> 04-5)’: 


i=1 j=l 
By (16.15), we obtain 


B = E,! Qn, 


468 ANALYSIS-OF-COVARIANCE 


where E,, is as given by (16.67) and e,, is as given by (16.68). Likewise, by (16.18), 
we have 


SSE). = Gy — eyEZ! ey, (16.77) 


where é,,, is as given in (16.69) or (16.76). The degrees of freedom of SSE,., are 
kn —1)-q. 
By (16.11) and (13.12), we obtain 


& = & — (Z'Z) Z/XB 


0 0 0 
M1. BX, yy. — BX. (ete 
_ | ¥. | | B®. | — | 2, — BX. (9-78) 
Yk BS. Ye. — BR 
9, — (Bika + By¥12 +++ + Bokig) 
0 ed ei Eee ay (16.79) 
Vy, — (Bike + Boker + +++ + Bykkq) 


16.5.3 Testing Hypotheses 
16.5.3.1 Treatments 
To test 


Aoi: a) = a2 = +++ = 


adjusted for the g covariates, we use the full—reduced-model approach as in Section 
16.3.3.1. The full model is given by (16.65), and the reduced model (with 
a) = 0) =-+- =a =a) is 


yg = Mt a+ BI + 8% 
=p + Bx; + ey, (16.80) 


which is essentially the same as the multiple regression model (7.3). By (7.37) and 
(7.39) and by analogy with (16.33), 


SSEra = SSTy.x = ty — 1, Ty’ ty, (16.81) 


where 1, is 


the elements of t,, are 


16.5 ONE-WAY MODEL WITH MULTIPLE COVARIATES 469 


y = So Oi ie 
ij 


So ur -— NOH -F)J, 7 =1,2, 00g 


and the elements of T,., are 


— X_y)(Xijs — X_s)s me 1 Doe akg Gy sS= 1,2,...,q. 


Thus, by analogy with (16.30), we use (16.81) and (16.77) to obtain 


SS(a, B) = SST). — SSE). 


—t.T't 


xy xx “XY — &yy + eB, Cxy 


= 2 Oe ¥) ~ 24 yy - t, Tt +e,E,! Cxy 


MOG; — 5.) — SyTarty + eB eo (16.82) 


which has k—1 degrees of freedom (see Problem 16.13). We display these sums of 
squares and products in Table 16.6. 
The test statistic for Ho}: aj = a2 = --- = ay, IS 


SS(a|u, B)/(k — 1) 


a 3 (16.83) 
SSE).x/[k(n — 1) — q] 
which (under Ho) is distributed as F[k — 1, k(n — 1) — q]. 
TABLE 16.6 Analysis-of-Covariance Table for Testing Ho1: a1 = a2 = --- = a, in 
the One-Way Model with q Covariates 
Source SS Adjusted for the Covariate Adjusted df 
Treatments SS(a|, B) = SST,.. — SSE). k-1 
Error SSEy.. = ey — ey Ey ey kn—-I-q 
Total SST). = hy — t, Tet kn—q-1 


470 ANALYSIS-OF-COVARIANCE 


16.5.3.2 Slope Vector 
To test 


Ao: B = 0, 
the sum of squares is given by (16.22) as 


SSH = e. Eve, 


xyo “XX 
where E,.. is as given by (16.67) and e,, is the same as in (16.68). The F statistic is 


then 


Ent en/4 
SSE}.x/[k(n — 1) — q]’ 


(16.84) 
which is distributed as F[g, k(n — 1) — q] if Hoo: B = 0 is true. 


16.5.3.3 Homogeneity of Slope Vectors 

The tests of Ho}: aj = @2 = --- = ay and Ho2: B = 0 assume a common coefficient 
vector for all k groups. To check this assumption, we can extend the model (16.65) 
to obtain a full model allowing for different slope vectors: 


yj =Mtaz+ BX;+e;, i=1,2,...,k, jf=1,2,...,n. (16.85) 


The reduced model with a single slope vector is given by (16.65). We now develop a 
test for the hypothesis 


Ao3: B, = By =--- = Bi, 


that is, that the k regression planes (for the k treatments) are parallel. 
By extension of (16.46) and (16.47), we have 


k 

SSE) yx = Gy — D- ey, En ex. i> (16.86) 
i=1 

SSE(R)y.. = ey — CyEx Gy, (16.87) 


where 
Ex.,i = XII —(1/n)JJX; and ey; = Xi — (1/n)Jly; 


are terms in the summations in (16.67) and (16.68). The degrees of freedom 
for SSE(P),., and SSE(R),., are k(n — 1) —kq =k(n—q-—1) and k(n— 1) —4, 


16.5 ONE-WAY MODEL WITH MULTIPLE COVARIATES 471 


respectively. Note that SSE(R),., in (16.87) is the same as SSE,., in (16.77). The 
estimator of B; for the ith group is 


B = Een: (16.88) 
By analogy to (16.48), the sum of squares for testing Ho3: B; = B, =--- = PB, is 
SSE(R),.. — SSE) yx = hy Cy, Eg i@ny.i — Egle, which has k(n — 1)— 


q — [k(n — 1) — kq] = q(k — 1) degrees of freedom. The test statistic for Ho3: By = 
B, =: = Bis 


[SSE(R),.. — SSE(P),y.]/q(k — 1) 
SSE(F),../k(n — q — 1) : 


(16.89) 


which is distributed as F[g(k — 1), k(n — q — 1)] if Ho3 is true. Note that if n is not 
large, n—q-—1 may be small, and the test will have low power. 


Example 16.5. In Table 16.7, we have instructor rating y and two course ratings x, 
and x> for five instructors in each of three courses (Morrison 1983, p. 470). 
We first find 6 and SSE,.,. Using (16.67), (16.68), and (16.69), we obtain 


1.0619 0.6791 1.0229 
Ba = eS ey eo = Gea a 


Then by (16.15), we obtain 


By (16.77) and (16.81), we have 


SSE), = .5585, SST,» = .7840. 


TABLE 16.7 Instructor Rating y and Two Course Ratings x; 
and x, in Three Courses 


Course 
1 2 3 
y x] x2 y x] X2 y x] X2 


2.14 2.71 250 2.77 2.29 245 1.11 1.74 1.82 


2.50 2.66 2.69 1.37 1.78 1.83 1.74 140 2.23 
140 2.80 2.00 1.52 2.18 2.24 1.15 1.80 1.82 
190 2.38 2.30 1.81 214 2.11 166 2.17 2.35 


472 ANALYSIS-OF-COVARIANCE 


Then by (16.82), we see that 


SS(a|, B) = SST).. — SSE). = .2254. 


The F statistic for testing Ho: a] = a2 = a3 is given by (16.83) as 


_ SS(alp,B)/(kK-1) _ .2254/2 
~ SSE)../[k(2—1)—q] 5585/10 


= 2.0182, p=.184. 


To test Ho2: B = 0, we use (16.84) to obtain 


_ My En eo / 
SSEy.x/[k(n — 1) — 4] 


= 27.2591, p=8.95x 107. 


Before testing homogeneity of slope vectors, Ho: B, = B. = B3, we first obtain 
estimates of B,, 5, and BP; using (16.88): 


k os 4236 .1900\ | /.2786 —(.0467 
B, = EY 1 ey,1 = = > 
, 1900 .4039 6254 1.5703 
a= es een 7 ee 
2 \.2758 4161 6649) \ 1.7159)’ 
ie be ia) 7 ee 
3 \..2133 4163 6492) \ 1.5993] 


Then by (16.86) and (16.87), we obtain 


3 
SSE yx = Sy — > e,, Ea ien,i = 55725, 


i=1 
SSE(R)y.. = ey — Cy EQ ery = 55855. 


The F statistic for testing Ho: B,; = B) = B; is then given by (16.89) as 


_ [SSE(R)).. — SSEP),..]/q(k — 1) 
— SSE(F)y../k(n — g — 1) 

_ .0012993/4 _ 
= 357956 = .003498. 


16.6 ANALYSIS OF COVARIANCE WITH UNBALANCED MODELS 473 
16.6 ANALYSIS OF COVARIANCE WITH UNBALANCED MODELS 


The results in previous sections are for balanced ANOVA models to which covariates 
have been added. The case in which the ANOVA model is itself unbalanced before 
the addition of a covariate was treated by Hendrix et al. (1982), who also discussed 
heterogeneity of slopes. The following approach, based on the cell means model of 
Chapter 15, was suggested by Bryce (1998). 

For an analysis-of-covariance model with a single covariate and a common slope 
B, we extend the cell means model (15.3) or (15.18) as 


y=0W.0( 4) +e= Wet pre. (16.90) 
This model allows for imbalance in the nj’s as well as the inherent imbalance in 
analysis of covariance models [see Bingham and Feinberg (1982) and a comment 
following (16.34)]. The vector sr contains the means for a one-way model as in 
(15.2), a two-way model as in (15.17), or some other model. Hypotheses about 
main effects, interactions, the covariate, or other effects can be tested by using 


contrasts on iS as in Section 15.3. 


The hypothesis Ho:8=0 can  be_ expressed in the form 


Ao2:(0,...,90, »(§) = 0. To test Ho2, we use a Statistic analogous to (15.29) or 


B 
(15.32). To test homogeneity of slopes, Ho3: 8; = B) =--: = PB, for a one-way 
model (or Ho3: 8; = By =--: = B, for the slopes of the a levels of factor A in a 
two-way model, and so on), we expand the model (16.90) to include the B,’s 


y=cw.wo(§) +e=WhtW.B+ &, (16.91) 
where B = (B,, By,..., B,)’ and W, has a single value of x; in each row and all other 
elements are Os. (The xj in W, is in the same position as the corresponding | in W.) 


Then Ho3: B; = B, = --- = B, can be expressed as Ho3:(O, o(4) =CBP=0, 


where C is a (k—1) xk matrix of rank k—1 such that Cj = 0. We can test 
Ho3: CB = 0 using a statistic analogous to (15.33). 

Constraints on the y’s and the #’s can be introduced by inserting nonsingular 
matrices A and A, into (16.91): 


y = WA 'Ap4+ W,A'A,B + €. (16.92) 


The matrix A has the form illustrated in (15.37) for constraints on the w’s. The matrix 
A, provides constraints on the B’s. For example, if 


=a 
a=(8) 


474 


ANALYSIS-OF-COVARIANCE 


where C is a (k — 1) x k matrix of rank k — 1 such that Cj = 0 as above, then the 
model (16.92) has a common slope. In some cases, the matrices A and A, would 


be the same. 
PROBLEMS 
16.1 Show that if the columns of X are linearly independent of those of Z, then 
X’(I — P)X is nonsingular, as noted preceding (16.14). 
16.2 (a) Show that SSE,., = e,, — eB, xy as in (16.18). 
(b) Show that e,, = y/(I— P)y as in (16.19). 
16.3 Show that for Hy: B = 0, we have SSH = f’X'(I — P)XP as in (16.21). 
16.4 Show that @ = (0, j, — Bx.,..-,9;% — BX)! as in (16.24). 
16.5 Show that éx =); (ij — i)". ey = oy Oy — HOG —F;), and ey = 
Dog (vy — Fi)”, as in (16.25). 
16.6 (a) Show that E,, has the form shown in (16.41). 
(b) Show that e,, has the form shown in (16.43). 
16.7 Show that the sums of products in (16.52) and (16.53) can be written as 
ik (Xie — Xi Vie — Yi) = ik XiikVijk — 1 i Xi, and ik (xix — X...) 
(vik — Y= Doign XV — NXT. 
16.8 Show that the “treatment sum of products” ey ai. /n—x..y../acn can be 
partitioned into the three sums of products in (16.57). 
16.9 (a) Express the sums of squares and test statistic for factor C in a form ana- 
logous to those for factor A in (16.58), (16.60), and (16.61). 
(b) Express the sums of squares and test statistic for the interaction AC in a 
form analogous to those for factor A in (16.58), (16.60), and (16.61). 
16.10 (a) Show that E,, = ee Xj[1 — (1/n)J]X; as in (16.67). 
(b) Show that e,, = pa X‘[1 — (1/n)Jly; as in (16.68). 
(c) Show that e,, = poea y;l — (1/n)Jly; as in (16.69). 
16.11 Show that the elements of X',.X;. are given by (16.72) and (16.73). 
16.12 Show that @ has the form given in (16.78). 
16.13 Show that )>,, (yy — ¥.)° — yO — 9)" =2 DOG, — ¥_)” as in (16.82). 
16.14 In Table 16.8 we have the weight gain y and initial weight x of pigs under 


four diets (treatments). 
(a) Estimate B. 
(b) Test Ho: a; = a2 = a3 = ay using F in (16.31). 


16.15 


16.16 


PROBLEMS 475 


TABLE 16.8 Gain in Weight y and Initial Weight x 
of Pigs 


Treatment 


y x y x y x y x 


165 30 180 24 156 34 201 41 
170 27 169 31 189 32 173 32 
130 20 171 20 138 35 200 30 
156 21 161 26 190 35 193 35 
167 33 180 20 160 30 142 28 
151 29 170 25 172 29 189 36 


Source: Ostle and Malone (1988, p. 445). 


(c) Test Ho: B = 0 using F in (16.36). 
(d) Estimate B,, B>, 83, and B, and test homogeneity of slopes Ho: B, = 
B, = B; = By using F in (16.49). 


In a study to investigate the effect of income and geographic area of residence 
on daily calories consumed, three people were chosen at random in each of 
the 18 income-—zone combinations. Their daily caloric intake y and age x are 
recorded in Table 16.9. 


(a) Obtain the sums of squares and products listed in Table 16.3, where zone 
is factor A and income group is factor C. 

(b) Calculate SS(A + E),.,, SSEy.x, and SSA,., using (16.58), (16.59), and 
(16.60). For factor A calculate F by (16.61) for Ho: a, = a2 = a3. 
Similarly, obtain the F statistic for factor C and the interaction. 

(c) Using SPE, SSE,, and SSE,., from parts (a) and (b), calculate the F stat- 
istic to test Ho: B = 0. 

(d) Calculate the separate slopes for the three levels of factor A, find SS(F) 
and SS(R), and test for homogeneity of slopes. Repeat for factor C. 


In a study to investigate differences in ability to distinguish aurally between 
environmental sounds, 10 male subjects and 10 female subjects were 
assigned randomly to each of two levels of treatment (experimental and 
control). The variables were x = pretest score and y = posttest score on audi- 
tory discrimination. The data are given in Table 16.10. 

We use the posttest score y as the dependent variable and the pretest score x 
as the covariate. This gives the same result as using the gain score (post—pre) 
as the dependent variable and the pretest as the covariate (Hendrix et al. 
1978). 

(a) Obtain the sums of squares and products listed in Table 16.3, where treat- 
ment is factor A and gender is factor C. 


476 


ANALYSIS-OF-COVARIANCE 


TABLE 16.9 Caloric Intake y and Age x for People 
Classified by Geographic Zone and Income Group 


Zone | Zone 2 Zone 3 
Income 
Group y x y x y x 
1 1911 46 1318 80 1127 74 


1560 66 1541 67 1509 71 
2639 38 1350 73 1756 60 


2 1034 50 1559 58 1054 83 
2096 33 1260 74 2238 47 
1356 44 1772 44 1599 71 


=) 2130 35 2027 32 1479 56 
1878 45 1414 51 1837 40 
1152 59 1526 34 1437 66 


4 1297 68 1938 33 2136 Si 
2093 43 1551 40 1765 56 
2035 59 1450 39 1056 70 


P) 2189 33 1183 54 1156 47 
2078 36 1967 36 2660 43 
1905 38 1452 53 1474 50 
6 1156 57 2599 35 1015 63 
1809 52 2355 64 2555, 34 
1997 44 1932 79 1436 54 


Source: Ostle and Mensing (1975, p. 482). 


TABLE 16.10 Pretest Score x and Posttest Score y on 
Auditory Discrimination 


Male Female 


Exp.” Control Exp. Control 


51 69 37 55 39 65 35 53 
62 69 64 66 32 66 62 65 
58 66 69 69 62 70 67 69 
52 61 70 69 64 68 51 68 
59 63 39 57 66 68 42 61 


“Experimental. 
Source: Hendrix (1967, pp. 154-157). 


PROBLEMS 477 


TABLE 16.11 Initial Age x,, Initial Weight x2, and Rate of Gain y of 40 Pigs 


Treatment 1 Treatment 2 Treatment 3 Treatment 4 


x2 y x] x2 y x] x2 y x] x2 y 


57 1.34 75 52 1.29 71 47 1.39 71 41 1.31 
45 1.55 63 43 1.43 66 52 1.39 63 40 1.27 
41 1.57 62 50 1.29 67 40 1.56 62 45 1.22 
40 1.26 67 40 1.26 67 40 1.36 67 39 1.36 


Source: Snedecor and Cochran (1967, p. 440). 


16.17 


(b) Calculate SS(A + E),.,, SSEy.., and SSA,., using (16.58), (16.59), and 
(16.60). For factor A calculate F by (16.61) for Hp: a; = ap. 
Similarly, obtain the F statistic for factor C and the interaction. 

(c) Using SPE, SSE,, and SSE,.,. from parts (a) and (b), calculate the F stat- 
istic to test Ho: B = 0. 

(d) Calculate the separate slopes for the two levels of factor A, find SS(F) 
and SS(R), and test for homogeneity of slopes. Repeat for factor C. 


In an experiment comparing four diets (treatments), the weight gain y 

(pounds per day) of pigs was recorded along with two covariates, initial 

age x, (days) and initial weight x. (pounds). The data are presented in 

Table 16.11. 

(a) Using (16.67), (16.68), and (16.69), find E,,,e,,, and e,,. Find B. 

(b) Using (16.77), (16.81), and (16.82), find SSE).,, SST).,, and SS(a| pu, B). 
Then test Ho: aj = a2 = a3 = a4, adjusted for the covariates, using the 
F statistic in (16.83). 

(c) Test Ho: B = 0 using (16.84). 

(d) Find B,, B), B3, and B, using (16.88). Find SSE(F),., and SSE(R)y.. 
using (16.86) and (16.87). Test Ho: B, = Bo = B; = By using (16.89). 


17 Linear Mixed Models 


17.1 INTRODUCTION 


In Section 7.8 we briefly considered linear models in which the y variables are 
correlated or have nonconstant variances (or both). We used the model 


y=XB+e, E(e)=0, cove)=Y=<C°V, (17.1) 


where V is a known positive definite matrix, and developed estimators for B in (7.63) 
and o” in (7.65). Hypothesis tests and confidence intervals were not given, but they 
could have been developed by adding the assumption of normality and modifying the 
approaches of Chapter 8 (see Problems 17.1 and 17.2). 

Correlated data are commonly encountered in practice (Brown and Prescott 2006, 
pp. 1-3; Fitzmaurice et al. 2004, p. xvi; Mclean et al. 1991). We can use the methods 
of Section 7.8 as a starting point in approaching such data, but those methods are 
actually of limited practical use because we rarely, if ever, know V. On the other 
hand, the structure of V is often known and in many cases can be specified up to rela- 
tively few unknown parameters. This chapter is an introduction to linear models for 
correlated y variables where the structure of & = o°V can be specified. 


17.2 THE LINEAR MIXED MODEL 


Nonindependence of observations may result from serial correlation or clustering of 
the observations (Diggle et al. 2002). Serial correlation, which will not be discussed 
further in this chapter, is present when a time- (or space-) varying stochastic process is 
operating on the units and the units are repeatedly measured over time (or space). 
Cluster correlation is present when the observations are grouped in various ways. 
The groupings might be due, for example, to repeated random sampling of subgroups 
or repeated measuring of the same units. Examples are given in Section 17.3. In many 
cases the covariance structure of cluster-correlated data can be specified using an 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


479 


480 LINEAR MIXED MODELS 


extension of the standard linear model (7.4) resembling the partitioned linear model 
(7.78). If y is an n x | vector of responses, the model is 


y = XB + Za, + Zoang +--+ + Znam + €, (17.2) 


where E(¢) = 0 and cov(é€) = o’l, as usual. Here X is ann x p known, possibly non- 
full-rank matrix of fixed predictors as in Chapters 7, 8, 11, 12, and 16. It could be 
used to specify a multiple regression model, analysis-of-variance model, or analysis 
of covariance model. It could be as simple as vector of 1s. As usual, B is ann x 1 
vector of unknown fixed parameters. 

The Z;’s are known n x r; full-rank matrices of fixed predictors, usually used to 
specify membership in the various clusters or subgroups. The major innovation in 
this model is that the a;’s are r; x 1 vectors of unknown random quantities similar 
to €. We assume that E(a;) = 0 and cov(a;) = o7 I, fori=1,...,m. For simplicity 
we further assume that cov(a;,a;) = O for i # j, where O is r; x r;, and that 
cov(a;, €) = O for all i, where O is r; x n. These assumptions are often reasonable 
(McCulloch and Searle 2001, pp. 159-160). 

Note that this model is very different from the random-x model of Chapter 10. 
In Chapter 10 the predictors in X were random while the parameters in B were 
fixed. Here the opposite scenario applies; predictors in each Z; are fixed while the 
elements of a; are random. On the other hand, this model has much in common 
with the Bayesian linear model of Chapter 11. In fact, if the normality assumption 
is added, the model can be stated in a form reminiscent of the Bayesian linear 
model as 


ylai, a2, ..-,anm is N,(XB ae Z\a; ae Zya Spats ZmnAms oI), 
a; is N,,(0, o7I,,) fori=1,...,m. 


The label linear mixed model seems appropriate to describe (17.2) because the 
model involves a mixture of linear functions of fixed parameters in B and linear func- 
tions of random quantities in the a;’s. The special case in which X = j (so that there is 
only one fixed parameter) is sometimes referred to as a random model. The o;’s 
(including o”) are referred to as variance components. 

We now investigate E(y) and cov(y) = > under the model in (17.2). 


Theorem 17.2. Consider the model y = XB + 5°", Zia; + €, where X is a known 
n X p matrix, the Z,’s are known n x r; full-rank matrices, B is a p x 1 vector of 
unknown parameters, € is an m x | unknown random vector such that E(€) = 0 
and cov(e) = o’I,,, and the a/s are r; x 1 unknown random vectors such that 
E(a;) = 0 and cov(a;) = o7'I,,. Furthermore, cov(a;, aj) = O for i 4 j, where O is 
ry X4rj;, and cov(a;,é) =O for all i, where O is r; x n. Then E(y) = XP and 
cov(y) = & = 07, of ZiZ, + o7 hh. 


17.3. EXAMPLES 481 


PROOF 


Ey) =E [xe + So Zia; or : 
i=1 


= xp +E (x: Zjaj aT -) 
i=1 


=XB+)~Z:E(a;) + E(e) [by G.21)and (3.38)] 
i=1 


= Xp. 


cov(y) = cov [xe + SS Zia; + 7 


i=l 


= cov 2 Za; + 7 
i=l 
= cov(Z;a;) + cov(e€) + = cov(Zja;, Zjaj) 
i=l iAj 
+ > cov(Z;a;, €) + S “cove, Z;a;) [see Problem 3.19] 


i=1 i=1 


- oy Zicov(a;)Z, + cov(e) + LS Zicov(ai, aj)Z; 
i=l iAj 
+ » Z;cov(aj, €) + S “cov(e, a)Z; [by Theorem 3.6d and Theorem 3.6e] 


i=1 i=l 


= oS 07 ZZ, + ol 


i=1 


Note that the z’s only enter into the covariance structure while the x’s only determine 
the mean of y. 


17.335 EXAMPLES 


We illustrate the broad applicability of the model in (17.2) with several simple 
examples. 


Example 17.3a (Randomized Blocks). An experiment involving three treatments 
was carried out by randomly assigning the treatments to experimental units within 
each of four blocks of size 3. We could use the model 


Yj HRT UT AT by, 


482 LINEAR MIXED MODELS 


where i= 1,...,3,/=1,...,4, a is N(O, G7), ey is NO, a”), and cov(a;, €) = 0. 
If we assume that the observations are sorted by blocks and treatments within 
blocks, we can express this model in the form of (17.2) with 


is L jz 03 03 03 
ib L 03; j, 03 05 
Sh Xe=|5 , and Z,= : 
i i Lb ; 0; 0; jz 03 
5 LG 0; 03 03 j; 
Then 
> O O O 
x O O 
2 / 2 

ar 0 oOo O 
Oo oO O 

o; + o o; a; 

where | = a? of +0? oO; 

a; o; a? + ao? 


Example 17.3b (Subsampling). Five batches were produced using each of two pro- 
cesses. Two samples were obtained and measured from each of the batches. 
Constraining the process effects to sum to zero, the model is 


Vik = M+ Ti + Gy + Exjx, 
where i = 1,2; 7=1,...,5;k = 1, 2; % = —1; aj is N(O, a) Ejjx 1S NO, o~); and 
cov(ajj, €j,) = 0. If the observations are sorted by processes, batches within pro- 
cesses, and samples within batches, we can put this model in the form of (17.2) with 


jo % +. O 
° ° 05 jo pees 05 
m=1.x= (I fo) and ZA=|.. 
Jio —ho : : : 
0 0 +. jr 
Hence 
> O -- O 
Oo % -:- O 
Y = o{ ZZ, + 07 ly = Ihe 
O x1 


of +o? a; 
where > = . 5 els 


17.3. EXAMPLES 483 


Example 17.3c (Split-Plot Studies). A 3 x 2 factorial experiment (with factors A 
and B, respectively) was carried out using six main units, each of which was subdi- 
vided into two subunits. The levels of A were each randomly assigned to two of the 
main units, and the levels of B were randomly assigned to subunits within main units. 
An appropriate model is 


Vik = M+ 7 + Oj + Oy + Gin + Fixx, 


where i=1,...,3;,=1,2;k=1,2; a, is N(O, 01°); eye is NO, a7) and 
cov(dix, yx) = 0. If the observations are sorted by levels of A, main units within 
levels of A, and levels of B within main units, we can express this model in the 
form of (17.2) with 


11001 0 10 0 0 0 0 
1 10001 0 1 00 0 0 
11001 0 10 0 0 0 0 
1 100010 1 00 0 0 
10101000 1 0 0 0 
Bene ee Pe Oe 8 ee | a 

10101000 1 0 0 0 
101001000 1 0 0 
1001100000 1 =0 
100101000001 
1001100000 21 0 
1001010000021 
100 0 0 0 
100 0 0 0 
0 10 0 0 0 
0 10 0 0 0 
00 1 0 0 0 

Fie SY AYERS 
000 1 0 0 
000 1 0 0 
000 0 1 0 
000 0 1 0 
000 0 0 1 
000 0 0 1 


484 LINEAR MIXED MODELS 


Then 
> O O 
Oo % -:: O 
y= oP ZZ +071) = _ |, | where 
Oo O x1 
ee ae 
o; a + o 


Example 17.3d (One-Way Random Effects). A chemical plant produced a large 
number of batches. Each batch was packaged into a large number of containers. 
We chose three batches at random, and randomly selected four containers from 
each batch from which to measure y. The model is 


Yij = LT Ai + Bijp 


wherei = 1,...,3;7=1,...,4; ajisN(, 01°); ej is N(O, a7); and cov(a;, €) = 0. 
If the observations are sorted by batches and containers within batches, we can 
express this model in the form of (17.2) with 


ja 04 04 
m= 1,X=jp, and Z; = 04 ja 04 
0, 04 jy 
Thus 
> oO O 
Y=o7Z,Z,+e7l2=}| O Y, O |, where 
Oo oOo 
a2 + o o; a; o; 
y= o; a2 + ao? ay o; 
' o; o; a; + a? o; 
o; a a; a2 + o? 


Example 17.3e (Independent Random Coefficients). Three pups from each of 
four litters of mice were used in an experiment. One pup from each litter was 
exposed to one of three quantitative levels of a carcinogen. The relationship 
between weight gain (y) and carcinogen level is a straight line, but slopes and 


17.3. EXAMPLES 485 


intercepts vary randomly and independently among litters. The three levels of the 
carcinogen are denoted by x. The model is 


Yij = Bo + aj + Bix; + Djx; + Ejj, 
where i= 1,...,4; j= 1,..., 3; a; is N(O, 01°); b; is N(O, 07); €; is N(O, a7), and 


all the random effects are independent. If the data are sorted by litter and carcinogen 
levels within litter, we can express this model in the form of (17.2) with 


js x js 9 03 03 
_ _|[b x _ | 0% jz 03 05 
m=2,X= = 2, = 0; 0; j, 03 |? and 
ji x 03 03; 03 jz 
x 03 03 03 
ae 03 x 03 03 
Be 03 03 x 03 
03 03 03 x 
Then 
> O O O 
Oo + O O 
Y= 0 PLZ) + 0P DZ), + 07 Ay = : 
0, 4,4; + 02° £245 + O12 0 0%, Oo 
Oo Oo O 


where 3) = 01733 + oo°xx’ + 07. 


Example 17.3f (Heterogeneous Variances). Four individuals were randomly 
sampled from each of four groups. The groups had different means and different 
variances. We assume here that 7” = 0. The model is 


Vij = Mi t+ Fy, 


wherei=1,...,4;7/=1,...,45 eg is N(O, a7). If the data are sorted by groups and 
individuals within groups, we can express this model in the form of (17.2) with 


in lL Ou Ou 
I 0. I 0 
m=4,X= : > Z, = , > Zz = E > 23 = : > 
I, Ou Ou i, 
lL Ou Ou Ou 
O4 
O4 
and Z4= 
O, 


486 LINEAR MIXED MODELS 


Hence 


O71 2 O4 Ou O4 
O4 O02 2 ly Ou Ou 
O4 O4 03 2 Ly Ou 
Ou O4 Ou O4 2 Ly 


Y= Ce AVA + oY LoL, + 03° ZZ), + ov Z4Z!, = 


These models can be generalized and combined to yield a rich set of models appli- 
cable to a broad spectrum of situations (see Problem 17.3). All the examples involved 
balanced data for convenience of description, but model (17.2) applies equally well to 
unbalanced situations. Allowing the covariance matrices of the a;’s and € to be non- 
diagonal (providing for such things as serial correlation) increases the scope of appli- 
cation of these models even more, with only moderate increases in complexity (see 
Problem 17.4). 


17.4 ESTIMATION OF VARIANCE COMPONENTS 


After specifying the appropriate model, the next task in using the linear mixed model 
(17.2) in the analysis of data is to estimate the variance components. Once the var- 
iance components have been estimated, } can be estimated and the estimate used 
in the approximate generalized least-squares estimation of B and other inferences 
as suggested by the results of Section 7.8. 

Several methods for estimation of the variance components have been proposed 
(Searle et al. 1992, pp. 168-257). We discuss one of these approaches, that of 
restricted (or residual) maximum likelihood (REML) (Patterson and Thompson 
1971). One reason for our emphasis of REML is that in standard linear models, 
the usual estimate s? in (7.22) is the REML estimate. Also, REML is general; for 
example, it can be applied regardless of balance. In certain balanced situations the 
REML estimator has closed form. It is often the best (minimum variance) quadratic 
unbiased estimator (see Theorem 7.3g). 

To develop the REML estimator, we add the normality assumption. Thus the 
model is 


y is N,(XB, X), where Y= ye of ZZ, + o7Tn, (17.3) 
i=1 


where X is n x p of rank r < p, and & is a positive definite n x n matrix. To simplify 
the notation, we let oe = a? and Zp = I, so that (17.3) becomes 


y is N, (XB, Z), where Y = > oP ZZ. (17.4) 
i=0 


17.4. ESTIMATION OF VARIANCE COMPONENTS 487 


The idea of REML is to carry out maximum likelihood estimation for data Ky 
rather than y, where K is chosen so that the distribution of Ky involves only the var- 
iance components, not f. In order for this to occur, we seek a matrix K such that 
KX = O. Hence E(Ky) = KX = 0. For simplicity we require that K be of full- 
rank. We also want Ky to contain as much information as possible about the variance 
components, so K must have the maximal number of rows for such a matrix. 


Theorem 17.4a. Let X be as in (17.3). A full-rank matrix K with maximal number of 
rows such that KX = O, is an (nm — r) X n matrix. Furthermore, K must be of the form 
K = Cd — H) = C[I — X(X'X)_X’] where C specifies a full-rank transformation of 
the rows of I— H. 


Proor. The rows ki of K must satisfy the equations k;X = 0’ or equivalently 
X’k; = 0. Using Theorem 2.8¢, solutions to this system of equations are given by 
k; = I — X X)e for all possible p x 1 vectors c. In other words, the solutions 
include all possible linear combinations of the columns of I— XX. 

By Theorem 2.8c(i), rank(X”X) = rank(X)=r. Also, by Theorem 2.13e, 
I—X X is idempotent. Because of this idempotency, rank(I— XX) = 
trd — XX) = t(D — tr(X X)=n-—r. Hence by the definition of rank (see 
Section 2.4), there are n — r linearly independent vectors k; that satisfy X’k; = 0 
and thus the maximal number of rows in K is n — r. 

Since k; = (I— X X)c, K = C(I — XX) for some full-rank (n — r) x n matrix 
C that specifies n — r linearly independent linear combinations of the rows of the 
symmetric matrix I— XX. By Theorem 2.8c(iv)—-(v), K can also be written as 
Cd — H) = C[I — X(X'X) X’]. 


There are an infinite number of such Ks, and it does not matter which is used. 
Also, note that (I— H)y gives the ordinary residual vector € in (9.5), so that 
Ky = C(I — DDy is a vector of linear combinations of these residuals. Thus the 
designation residual maximum likelihood is appropriate. 

The distribution of Ky for any K defined as in Theorem 17.4a is given in the 


following theorem. 


Theorem 17.4b. Consider the model in which y is N,(XB, >), where 
y= oy 07 Z:Zi, and let K be specified as in Theorem 17.4a. Then 


Ky is N,_,-(0, KZK’) or Nr]. (So oft) (17.5) 


i=0 


Proor. Since KX = O, the theorem follows directly from Theorem 4.4a(ii). 


Thus the distribution of the transformed data Ky involves only the m+ 1 variance 
components as unknown parameters. In order to estimate the variance components, 
the next step in REML is to maximize the likelihood of Ky with respect to these 


488 LINEAR MIXED MODELS 


variance components. We now develop a set of estimating equations by taking partial 
derivatives of the log likelihood with respect to the variance components, and setting 
them to zero. 


Theorem 17.4c. Consider the model in which y is N,(XB, >), where 
+= yo) o7 ZZ), and let K be specified as in Theorem 17.4a. Then a set of 
m + | estimating equations for o},...,07 is given by 


tr[K’(K=K’)'KZ;Z/] = y/K'(KSK’)'KZ;Z/K'(K>K’)'Ky (17.6) 


fori=0,...,m. 


Proor. Since E(Ky) = 0, the log likelihood of Ky is 


- 1 1 
InL(o?,...,02) =— 5 "InQm 5 In| KEK’ — 5y'K'(KEK’) "Ky 


K (s: ott) id 


i=0 


2 1 
=) 5 "in(2m) sin 


—1 
Ky 


1 Ig! . 2 / / 
“ye k(S a ZZ; K 


Using (2.117) and (2.118) to take the partial derivative of InL(os, : ms) with 
respect to each of the o’s, we obtain 


fe) 
0o;? 


0 2 2 I n\-1 
jan ne oan -5O,,) = 5 (KEK) 


«Kzk)]) 
1 Iq! 1\-1 0 / 1y—1 
+ <y'K’(KEK’) ||, (K3K’)| (K3K’) 'Ky 
2) 00; 
=— StlOKEK')KZZ)K 
+ SYK (KEK) 'KZ)Z}K/(KEK')'Ky 
=— Stk! (KEK) "KZ, 


1 
+ 5Y K(KER’) 'KZZ;K'(KEK') Ky 


Setting these equations to zero, the result follows. 


17.4. ESTIMATION OF VARIANCE COMPONENTS 489 


It is interesting to note that using Theorem 5.2a, the expected value of the quad- 
ratic form on the right side of (17.6) is given by the left side of (17.6). 

Applying Theorem 17.4c, we obtain m + 1 equations in m + 1 unknown o;”’s. In 
some cases these equations can be simplified to yield closed-form estimating 
equations. In most cases, numerical methods have to be used to solve the equations 
(McCulloch and Searle 2001, pp. 263-269). 

If the solutions to the equations are nonnegative, the solutions are REML estimates 
of the variance components. If any of the solutions are negative, the log likelihood 
must be examined to find values of the variance components within the parameter 
space (i.e., nonnegative values) that maximize the function. 


Example 17.4 (One-Way Random Effects). This is an extension of Example 
17.3(d). Four containers are randomly selected from each of three batches produced 
by a chemical plant. Hence 


ja 04 04 
X=jp,Zo=T2,Z:= | 04 jy 04 and Y= oly +0/;ZZ). 
0, 04 Jy 


Then I— H = Ip — bIn. a suitable C would be C = (Ij2, 012), and K = C(I — H). 
Inserting these matrices into (17.6), it can be shown that we obtain the two estimating 
equations 


905 =y' Up — iL Zi y, 
2(407 + 09) =y (AZZ, — EIpy. 


From these we obtain the closed-form solutions 


oo _ ¥ (lin —$Z1Z))y 


Oo => 9 : 
AD y (§Z:Z, — GI) y/2 — 6 
Fie 
4 


If both G} and G? are positive, they are the REML estimates of a? and a7. Because 
(2 - 4ZZ\) is positive definite, a will always be positive. However, a? could be 
negative. In such a case, the REML estimates become 


ge y (liz sue 


a, = 0: 


490 LINEAR MIXED MODELS 


In practice, the equations in (17.6) are seldom used directly to obtain solutions. 
The usual procedure involves any of a number of iterative methods (Rao 1997 
pp. 104-105, McCulloch and Searle 2001, pp. 265-269) To motivate one of 
these methods, note that the system of m-+ 1 equations generated by (17.6) can be 
written as 


Mo = q, (17.7) 
where o@ = (a0; oo o2y, M is a nonsingular (m+ 1) x (m+ 1) matrix with (i)th 
element tr[K’(KEK’)'KZ;Z/K'(K2K')"'KZ,Z}], and q is an (m+ 1) x 1 vector 
with ith element y’ K’(KXK’)'KZ,Z/K'(K2K’)'Ky (Problem 17.6). Equation 
(17.7) is more complicated than it looks because both M and q are themselves func- 
tions of o. Nonetheless, the equation is useful for stepwise improvement of an initial 
guess (1). The method proceeds by computing My) and q,, using Oi) at step 7. Then 


let O41) = Mo) Qi). The procedure continues until oi) converges. 


17.5 INFERENCE FOR £6 


17.5.1 An Estimator for B 


Estimates of the variance components can be inserted into } to obtain 
= 6 O72). A sensible estimator for B is then obtained by replacing o?V 
in equation (7.64) by its estimate, }. Generalizing the model to accommodate non- 
full-rank X matrices, we obtain 


B= (X'S 'xyX’S'y. (17.8) 


This estimator, sometimes called the estimated generalized least-squares (EGLS) 
estimator, is a nonlinear function of y (since } is a nonlinear function of y). Even 
if X is full-rank, B is not in general a (minimum variance) unbiased estimator 
(MVUE) or normally distributed. However, it is always asymptotically MVUE and 
normally distributed (Fuller and Battese 1973). 

Similarly, a sensible approximate covariance matrix for B is, by extension of 
(12.18), as follows: 


cov(B) = (X'S !1X)- X'S X(X'S xy. (17.9) 
Of course, if X is full-rank, the expression in (17.9) simplifies to 


cov(B) = (X’S' x)! 


17.5. INFERENCE FOR B 491 


17.5.2 Large-Sample Inference for Estimable Functions of B 


Carrying the procedure of replacing o*V by its estimate > a bit further, it seems 
reasonable to extend Theorem 12.7c(ii) and conclude that for a known full-rank 
g X p matrix L whose rows define estimable functions of B 


LB is approximately N,[LB, L(X’S~'X)~L] (17.10) 


and therefore by (5.35) 
(LB — LB) (L(x! X)-L'}-'(LB — LB) is approximately y2(g). (17.11) 


If so, an approximate general linear hypothesis test for the testable hypothesis 
Ho: LB = t is carried out using the test statistic 


G = (LB — t) [LS 'X) LY LB — b. (17.12) 


If Ho is true, G is approximately distributed as y7(g). If Ho is false, G is approxi- 
mately distributed as y7(g,A) where A = (LB — t) [L(X’S'xX)-L') (LB —t). 
The test is carried out by rejecting Hp if G > Xe, eo 

Similarly, an approximate 100(1 — a)% confidence interval for a single estimable 
function c’B is given by 


CB + zap eX 'Xye. (17.13) 


Approximate joint confidence regions for B, approximate confidence intervals for 
individual B;’s, and approximate confidence intervals for E(y) can be similarly pro- 
posed using (17.10) and (17.11). 


17.5.3 Small-Sample Inference for Estimable Functions of 6 


The inferences of Section 17.5.2 are not satisfactory for small samples. Exact small- 
sample inferences based on the f¢ distribution and F distribution are available in rare 
cases, but are not generally available for mixed models. However, much work has 
been done on approximate inference for small sample mixed models. 

First we discuss the exact small-sample inferences that are available in rare cases, 
usually involving balanced designs, nonnegative solutions to the. REMP Seuanons, 
and certain estimable functions. In order for this to occur, [LX’>~ 'x)- L’]-! must 
be of the form (d/w)Q, where w is a central chi-square random variable with d 
degrees of freedom, and independently (LB — t)'Q(LB — t) must be distributed as 
a (possibly noncentral) chi-square random variable with g degrees of freedom. 


492 LINEAR MIXED MODELS 


Under these conditions, by (5.30), the statistic 


(LB-tv/QLB-tw  (LB-b'[L&’E |X LT'LB- tb) 
8 d 8g 


is F-distributed. We demonstrate this with an example. 


Example 17.5 (Balanced Split-Plot Study). Similarly to Example 17.3c, consider a 
3 x 2 balanced factorial experiment carried out using six main units, each of which is 
subdivided into two subunits. The levels of A are each randomly assigned to two of 
the main units, and the levels of B are randomly assigned to subunits within main 
units. We assume that the data are sorted by replicates (with two complete replicates 
in the study), levels of A, and then levels of B. We use the cell means parameterization 
as in Section 14.3.1. The means in B are sorted by levels of A and then levels of B. 
Hence 


> oO O O O O 
Oo}; 0 0 0 O 
x (7) and > = i ey ee , Where 
6 o Oo oO O O 
0 0 0 0% O 
0 0 0 0 0 &, 


We test the no-interaction hypothesis Hy : LB = 0, where 
1 -!l -1 1 0 0 
ix ( 1 -l 0 0 -1 1 ) , 


Assuming that the REML estimating equations yield nonnegative solutions, G7 is 
given by 


OF OORO 
ROOFS 
OOF OOR 


O 

O 

—R 1 -l 
O y where R= (_| : 
O 

R 


OF OOK O 


17.5. INFERENCE FOR B 493 


Multiplying and simplifying, we obtain 


: zy! Oo Oo 
X2'x=2/ o 3! Oo 
0 o 
By (2.52), we have 
; y' O O 
(2X '=3/ O Fy! Oo 
0 o 
Thus 
[L’S xy! Ly! 
ae ey 
. Sf) <=] 
(Oe OO oer ole 
a Eee Tee ST a i 1.~ 
Oo 0 —1 
0 1 
_ 3 1 | 
367/62 |302\-1 2 
3 
=-Q, 
WwW 
where 
362 1 Of. A. 


Also note that in this particular case, the EGLS estimator is equal to the ordinary 
least-squares estimator for B since 


B = (X'S IX) EX'S ly 


. o-1 1 
3,0 0\({% © 0 %' oO Oo 
=+/0 3%, oO o y' 0 oOo &' Oly 
0 9 */\o o ' 0 0 FF 
=3(Is Is)y 


= (X’X) I x’y. 


494 LINEAR MIXED MODELS 


Hence 
(LB) Q(B) = y'X(X'X)'L/QL(X’X)!X’y. 


It can be shown that X(X’X)~'L’QL(X’X)!X’> is idempotent, and thus 
(LB)' Q(LB) is distributed as a chi-square with 2 degrees of freedom. It can similarly 
be shown that w is a chi-square with 3 degrees of freedom. Furthermore, w and 
(LB)' Q(LB) are independent chi-squares because of Theorem 5.6b. Thus we can 
test Ho : LB = 0 using the test statistic (LB) [L(X’>~'X)-'L’}-'(LB)/2 because 
its distribution is exactly an F distribution. 

If even one observation of this design is missing, exact small-sample inferences 
are not available for Lf. Exact inferences are not available even when the 


design is balanced for estimable functions such as c’B where 
e=(1 0 0 -1 O O). 


In most cases, approximate small-sample methods must be used. The exact distri- 
bution of 


je (17.15) 


Ve(X'> 'xye 


is unknown in general (McCulloch and Searle 2001, p. 167). However, a satisfactory 
small-sample test of Ho : c’'B =0 or confidence interval for e’B is available by 
assuming that ¢ approximately follows a f distribution with unknown degrees of 
freedom d (Giesbrecht and Burns 1985). To calculate d, we follow the premise of 
Satterthwaite (1941) to assume, analogously to Theorem 8.4aiii, that 


d{e'(X'%'X)-e] 
e(X''X)e 
approximately follows the central chi-square distribution. Equating the variance of 
the expression in (17.16) 


(17.16) 


de'(X'S'X)-e] d | ee es 
"| ex'S xy e e(X'S xy VALE Yeh 


to the variance of a central chi-square distribution, 2d (Theorem 5.3a), we obtain the 
approximation 


gi 2[e'(X’S'X)y-e? 


: 17.17 
var[e/(X’ |X) e] ‘ 


17.5. INFERENCE FOR B 495 


This approximation cannot be used, of course, unless var[e’(X’>_ 'X)e] is known 


or can be estimated. We obtain an estimate of var[e’(X’ > x) e] using the multi- 
variate delta method (Lehmann 1999, p. 315). This method uses the first-order multi- 
variate Taylor series (Harville 1997, p. 288) to approximate the variance of any 
scalar-valued function of a random vector, say, f(@). By this method var[f(@)] is 
approximated as 


_ FO) 2 FO) 
0)| = >; 
va (Ol= a9 | 20-9 |, , 
where 
Of (9) 
00 | 6-6 


is the vector of partial derivatives of f(@) with respect to @ evaluated at 6 and Ya 
denotes an estimate of the covariance matrix of @. In the case of inference for ¢’ B 
in the mixed linear model (17.4), let 8 = o and f(a) = [c’(X' >! X)-e]. Then 


e(X/E TX) XE! ZoZh¥'X(X'E'X)e 
af(@)| | (XE X) XD ZZ, EX(K’E'Xye 
0a =] : 


o=-o 


(X'S XY XS 'Z,, ZS XX/S xe 


Also >S an estimate of the covariance matrix of &, can be obtained as the inverse of 
the negative Hessian [the matrix of second derivatives — see Harville (1997, p. 288)] 
of the restricted log-likelihood function (Theorem 17.4c) evaluated at @ (Pawitan 
2001, pp. 226, 258). 

We now generalize this idea obtain the approximate small-sample distribution of 


_ (LB LBY IL’ 'X) LLB — LB) 
& 


F 


(17.18) 


in order to develop tests for Hp: LB = t and joint confidence regions for LB. We 
obtain these inferences by assuming that the distribution of F is approximately an 
F distribution with numerator degrees of freedom g, and unknown denominator 
degrees of freedom v (Fai and Cornelius 1996). The method involves the spectral 


decomposition (see Theorem 2.12b) of [LOS xy Ly! to yield 


P(L(X’S !X-L’}-'P = D, 


496 LINEAR MIXED MODELS 


where D = diag(Aj, A2,...,Am) is the diagonal matrix of eigenvalues and 
P= (Pp, Po. ---,P,,) is the orthogonal matrix of normalized eigenvectors of 


[Lcx’ So xy Lr. Using this decomposition, G = gF can be written as 
& ’T p2 g 
(p';LB) 2 
G= sr = tt 17.19 


where the #;’s are approximate independent f-variables with respective degrees of 
freedom vj. 

We compute the v; values by repeatedly applying equation (17.16). Then we find v 
such that F = g~'G is distributed approximately as F gv: Since the square of a f-dis- 
tributed random variable with v; degrees of freedom is an F-distributed random vari- 
able with | and v; degrees of freedom: 


(e) 


g 
=> * [by (5.34)]. 


E(G) 


Now, since E(F) = 1/g E(G) = v/(v — 2), 


— 2B@ _ of yy, — 
""E@—8 a(S) /|(Sats) - 4] (17.20) 


A method due to Kenward and Roger (1997) provides further improvements for 
small-sample inferences in mixed models. 


1. The method adjusts for two sources of bias in LX’S XL as an estimator of 
the covariance matrix of LB in small-sample situations, namely, that 
L(X’S"'X)-L’ does not account for the variability in 6, and _ that 
L(X’>'X)-L! is a biased estimator of L(X’S~!X)~L’. Kackar and Harville 
(1984) give an approximation to the first source of bias, and Kenward and 
Roger (1997) propose an adjustment for the second source of bias. Both adjust- 


ments are based on a Taylor series expansion around o (Kenward and Roger 
1997, McCulloch and Searle 2001, pp. 164-167). The adjusted approximate 


covariance matrix of LB is 


Lig = LIX’ xX) + 2a3-'x| >) +15 -P8,2)| 
os 2 = ie Oo. FBR) 


i=0 j 


x (X'S! X)-]L’ 


17.6 INFERENCE FOR THE a; 497 


where sj is the (i, j)th element of Yo. 


az! ay as! 


Qi = x’ Dez ~ 552 X, and P;=X’ Doe X. 
2. Kenward and Roger (1997) assume that 
Os ax . 
F* = 6F gr = gt B> 1 pLB) (17.22) 


is approximately F-distributed with two (rather than one) adjustable constants, 
a scale factor 6, and the denominator degrees of freedom v. They use a second- 
order Taylor series expansion (Harville 1997, p. 289) of za around o and 
conditional expectation relationships to yield E(Fxr) and var(Fxr) approxi- 
mately. After equating these to the mean (5.29) and variance of the F distri- 


bution to solve for 6 and v, they obtain 


and 


5 ns Sere 
EF xR) — 2) 
where 
__ var(Fxr) 
~ 2E(Fxr)” 


These small-sample methods result in confidence coefficients and type I error rates 
closer to target values than do the large-sample methods. However, they involve 
many approximations, and it is therefore not surprising that simulation studies have 
shown that their statistical properties are not universally satisfactory (Schaalje et al. 
2002, Gomez et al. 2005, Keselman et al. 1999). 

Another approach to small-sample inferences in mixed linear models is the 
Bayesian approach (Chapter 11). Bayesian linear mixed models are not much 
harder to specify than Bayesian linear models, and Markov chain Monte Carlo 
methods can be used to draw samples from exact small-sample posterior distributions 
(Gilks et al. 1998, pp. 275-320). 


17.6 INFERENCE FOR THE a; 


A new kind of estimation problem sometimes arises for the linear mixed model in 
(17.2) 


m 


y=XB+) Za;+e, (17.23) 


i=l 


498 LINEAR MIXED MODELS 


namely, the problem of estimation of realized values of the random components (the 
a;’s) or linear functions of them. For simplicity, and without loss of generality, we 
rewrite (17.22) as 


y= XB+ Za+e, (17.24) 


where Z = (Z,Zy...Zm), a = (aja),...a’)', € is N(O, o7I,), a is N(O, G) where 


ol, O -- O 
gal eee ae |, 
O° SO 24 ee 


and cov(¢, a) = 0. Then the problem can be expressed as that of estimating a or a 
linear function Ua. To differentiate this problem from inference for an estimable func- 
tion of B, the current problem is often referred to as prediction of a random effect. 

Prediction of random effects dates back at least to the pioneering work of 
Henderson (1950) on prediction of the “value” of a genetic line of animals or 
plants, where the line is viewed as a random selection from a population of such 
lines. In education the specific effects of randomly chosen schools might be of inter- 
est, in medical research the effect of a randomly chosen clinic may be desired, and in 
agriculture the effect of a specific year on crop yields may be of interest. The phenom- 
enon of regression to the mean (Stigler 2000) for repeated measurements is closely 
related to prediction of random effects. 

The general problem is that of predicting a for a given value of the observation 
vector y. Note that because of the model in (17.23), a and y are jointly multivariate 
normal, and 


cov(a, y) = cov(a, XB + Za + e) 
= cov(a, Za + &) 
= cov(a, Za) + cov(a,é) (see Problem 3.19) 
=GZ'+0O 
= GZ’. 
By extension of Theorem 10.6 to the case of a random vector a, the predictor based 
on y that minimizes the mean squared error is E(aly). To be more precise, the vector 


function t(y) that minimizes E[a — t(y)]'[a — t(y)] is given by t(y) = E(aly). 
Since a and y are jointly multivariate normal, we have, by (4.26) 


E(aly) = E(a) + cov(a, y)[cov(y)]" "Ly — E(y)] 
=0+GZ'>'(y- Xp) (17.25) 
= GZ'> '(y — XB). 


17.6 INFERENCE FOR THE a; 499 


If B and & were known, this predictor would be a linear function of y. It is 
therefore sometimes called the best linear predictor (BLP) of a. More generally, 
the BLP of Ua is 


E(Ualy) = UGZ'>"'(y — Xp). (17.26) 
Because the BLP is a linear function of y, the covariance matrix of E(Ualy) is 
cov[E(Ualy)] = UGZ'>"'ZGU'. (17.27) 


Replacing B by B in (17.8), and replacing G and & by Gand > (based on the REML 
estimates of the variance components), we obtain 


E(Waly) = UGZS"|(y — x). (17.28) 


This predictor is neither unbiased nor a linear function of y. Nonetheless, it is an 
approximately unbiased estimate of a linear predictor, so it is often referred to as the 
estimated best linear unbiased predictor (EBLUP). Ignoring the randomness in G 


and &, we obtain 
cov[E(Ualy)] = cov[UGZ’~!(y — XB)] 
= cov{UGZ’S |[f— X(X'S'xy-x’S | Jy} 
= UGZ> | [1— x(x’) xX) xX’ ps — ST xcx’S XX) XY 
x > 'ZGU' 
= UGZ[S! — 37'x(x'!x)-X’S ZG’ 


=UGZ[S7! — 5 X(X'S x) x’S ZG’. (17.29) 


Small-sample improvements to (17.28) have been suggested by Kackar and Harville 
(1984), and approximate degrees of freedom for inferences based on EBLUPs have 
been investigated by Jeske and Harville (1988). 


Example 17.6 (One-Way Random Effects). To illustrate EBLUP, we continue with 
the one-way random effects model of Examples 17.3d and 17.4 involving four con- 
tainers randomly selected from each of three batches produced by a chemical plant. In 
terms of the linear mixed model in (17.23), we obtain 


ja 04 04 
X=jpnB=y4,Z 0, js 04 |, G=ojIs, and 
0, 04 jy 
oly + o7 54 O, O, 
Y= + of ZL = Ou o7ly + 0734 O, 


O4 O, ol, + o7d4 


500 LINEAR MIXED MODELS 


By (2.52) and (2.53), 


oF 
| 6 ance een O O 
4 a2 + 4a? 4 7 4 
- 1 o 
io 1 
ae ie GP dap ~ 
2 
oO 
O O | ee eee | 
‘ 4 : o +407 : 


To predict a, which in this case is the vector of random effects associated with the 
three batches, by (17.27) and using the REML estimates of the variance components, 
we obtain 


o/ / / 
. jy 0, 0, 
EBLUP(a)=GZ/='(y—XB)=6713 | 0, §, 0, |! — Ain) 
0,04 i 
4G? 
e/ 1 / / / 
J4 joaaee4 0, 0, 
a2 a2 
_ ao ‘ 4G; : Age 
=52 0; is Bagels 0, (y— Biz) 
4a; 
0, 0, L-aygals 
gr [4% % 
1 fos / ~ © 
=——, | 0 0 — 
P4452 : : : (y— Aji) 
0, 0, jy 
52 yi. 4h 4G? ViAy 
1 aS 1 = a 
= 4 
624462 |? | G24 462 | 72% 
y3.— 4 ame 
Thus 
462 
EBLUP(a,;) = ! y ). (17.30) 


If batch had been considered a fixed factor, and the one-way ANOVA model in (13.1) 
had been used with the constraint > a; = 0, we showed in (13.9) that 


aj = G;, — ¥.). 


Thus EBLUP(a;) = ca@; where 0 < c < 1. For this reason, EBLUPs are sometimes 
referred to as shrinkage estimators. 


17.7. RESIDUAL DIAGNOSTICS 501 


The approximate covariance matrix of the EBLUPs in (17.29) can be derived 
using (17.28), and confidence intervals can then be computed or hypothesis tests 
carried out. 


An extensive development and discussion of EBLUPs is given by Searle et al. 
(1992, pp. 258-289). 


17.7 RESIDUAL DIAGNOSTICS 


The assumptions of the linear mixed model in (17.2) and (17.3) are independence, 
normality, and constant variance of the elements of each of the a; vectors, as well 
as independence, normality, and constant variance of the elements of ¢. These 
assumptions are harder to check than for the standard linear model, and the usefulness 
of various types of residual plots for mixed model diagnosis is presently not fully 
understood (Brown and Prescott 1999, p. 77). 

As a first step, we can examine each of the EBLUP (a,) vectors as in (17.27) for 
normality, constant variance and independence (see Section 9.1). This makes sense 
because, using (4.25) and assuming for simplicity that } (and therefore G) are 
known, we have 


cov(aly) = G — GZ/S"'ZG. 
Thus if U=(O...O1,,0...0), 
cov(Ualy) = cov(ajly) 
= UGU — UGZ’>"'ZGU' 
= 071, — of 2 'Z; 


= 07 dh, — 0° LZ). (17.31) 


As was the case for the hat matrix in Section 9.1, the off-diagonal elements of the 
second term in (17.31) are often small in absolute value. Hence the elements of 
EBLUP(a) should display normality, constant variance, and approximate indepen- 
dence if the model assumptions are met. It turns out, however, that constant variance 
and normality of the EBLUP(a;) vectors is a necessary rather than a sufficient con- 
dition for the model assumptions to hold. Simulation studies (Verbeke and 
Molenberghs 2000, pp. 83—87) have shown that EBLUPs tend to reflect the distribu- 
tional assumptions of the model rather than the actual distribution of random effects 
in some situations. 

The next step is to consider the assumptions of independence, normality, and con- 
stant variance for the elements of ¢. The simple residual vector y — XB is seldom 
useful for this purpose because, assuming that } is known, we have 


cov(y — XB) = cov{ {I — X(X/S7'X)- X'S‘ Jy} 
= (I— X(X/S'X) X'S SN — XXX) X’), 


502 LINEAR MIXED MODELS 


which may not exhibit constant variance or independence. However, the vector 


aad *1y — XB), where > > is the inverse of the square root matrix of > (2.109), 
does have the desired properties. 


Theorem 17.7. Consider the model in which y is N,(XB,%), where Y = 071+ 
y2”", 0?Z;Z). Assume that ¥ is known, and let B = (X’~'X)-X'="'y. Then 


cov[>!(y —xf)] = 1-H, (17.32) 
where H, = 37 !/?x(x’'S!xy- x’? 
PROOF 
cov[s/(y — XB)] = cov{S 7 — X(X'S EX) X’E Hy} 
= La XX SX) XS 


x Sf — SExy’ xy xs? 
= 5337? _ yes xy x’ EY, 


Now, since » Sia = (cD!/?c’)"! where C is orthogonal as in Theorem 2.12d, and 
D! isa diagonal matrix as in (2.109), we obtain 


_ _ 1 1 
>! ?33-1/ = (cD 5C) ‘epee cy! 
= CD"!?cC’cpc’'cp"!?2C’ 


= cD!pp'!?C’ 
=CC'=I 


and the result follows. 


Thus the vector ¥~!/ *1y — XB) can be examined for constant variance, normality 
and approximate independence to verify the assumptions regarding e. 

A more common approach (Verbeke and Molenberghs 2000, p. 132; Brown and 
Prescott 1999, p. 77) to verifying the assumptions regarding € is to compute and 
examine y — XB — Za. To see why this makes sense, assume that } and B are 
known. Then 


cov(y — XB — Za) = cov(y) — cov(y, Za) — cov(Za, y) + cov(Za) 
= > -— ZGZ' — ZGZ' + ZGZ' 
=> -ZGZ’ 


= (ZGZ' + oI) — ZGZ’ 


=o'l. 


PROBLEMS 503 


PROBLEMS 


17.1 


17.2 


17.3 


17.4 


Consider the model y = XB + ¢, where € is N,,(0, a’V), V is a known posi- 
tive definite n x n matrix, and X isa knownn x (k + 1) matrix of rank k + 1. 
Also assume that C is a known q x (k + 1) matrix and t is a known g x 1 
vector such that CB = t is consistent. Let B = (X/V~'X)-!X’V~!y. Find 
the distribution of 


(CB — t[CX'V' XC) CB 8/¢ 


F= 
y[V-! — V-1X’V EX) I X’Vy/(n —k — 1) 


(a) Assuming that Hj): CB = t is false. 
(b) Assuming that Hp: CB = t is true. 


(Hint: Consider the model for P 'y, where P is a nonsingular matrix such 
that PP’ = V.) 


For the model described in Problem 17.1, find a 10001 — a@)% confidence 
interval for a’ B. 


An exercise science experiment was conducted to investigate how ankle roll 
(y) is affected by the combination of four casting treatments (control, tape 
cast, air cast, and tape and brace) and two exercise levels (preexercise and 
postexercise). Each of the 16 subjects used in the experiment was assigned 
to each of the four casting treatments in random order. Five ankle roll 
measurements were made preexercise and five measurements were made 
post exercise for each casting treatment. Thus a total of 40 observations 
were obtained for each subject. This study can be regarded as a randomized 
block split-plot study with subsampling. A sensible model is 


Vijkl = M+ 7; + Oj + Oy + ag + Dik + Cijk + Eijk, 


wherei=1,...,4;7=1,2;k=1,..., 16;1J=1,...,5;q,is NO, o7)3 Dix 
is N(O, 03); Ci is NO, 03); Ej 18S NO, a”), and all of the random effects are 
independent. If the data are sorted by subject, casting treatment, and exercise 
level, sketch out the X and Z; matrices for the matrix form of this model as in 
(17.2). 


(a) Consider the model y = XB + 5>", Z;a; + € where X is a known n x p 
matrix, the Z;’s are known n x r; full-rank matrices, B is ap x 1 vector of 
unknown parameters, € is an n x | unknown random vector such that 
E(e) = 0 and cov(e)= R # o7I,, and the a,’s are r; x 1 unknown 
random vectors such that E(a;)=0 and cov(a;) = G; 4 o7I,,. As 
usual, cov(a;, aj) = O for i 4 j, where O is r; x r;, and cov(a;, €) = O 
for all i, where O is r; x n. Find cov(y). 


504 


17.5 


17.6 


17.7 


17.8 


17.9 


17.10 


LINEAR MIXED MODELS 


(b) For the model in part (a), let Z = (Z)Z2...Zm) anda = (a) a)...a',)’ so 
that the model can be written as y = XB + Za-+ e and 


G oO... O O 
O GG... O O 
cov(a) =G= pe 
O O ... Gri O 
Oo oO... OO Gh 


Express cov(y) in terms of Z, G, and R. 


Consider the model in which y is N,(XB, &), where & = 77") 07 Z;Zi, and 
let K be a full-rank matrix of appropriate dimensions as in Theorem 17.4c. 
Show that for any i, 


Ely’K'(K2K’)'KZ;Z/K’(K2K’)'Ky] = tr[K'(K2K’)'KZ,Z/]. 
Show that that the system of m-+ 1 equations generated by (17.6) can be 
written as Mo = q, where 0 = (a a; Los Tae M is an (m+ 1) x (m+ 1) 
matrix with ijth element tr[K’(KEK’)'KZ;Z/K' (KEK')'KZ)Zi], and q 
is an (m+ 1)x1 vector with ith element 
y'K'(K>K’ ) 'KZ,Z/K' (KXK’)'Ky. 

Consider the model in which y is N,,(XB, >), and let L be a known full-rank 
g X p matrix whose rows define estimable functions of B. 

(a) Show that L(X’S'X)-L is nonsingular. 

(b) Show that (LB — LB) [L(X’S'X)-L'} (LB — LB) is y2(g). 

For the model described in Problem 17.7, develop a 100(1 —)% confidence 
interval for E(yo) = xpB. 


Refer to Example 17.5. Show that 


y+ oO O 
Xt Sh Oe Sy. 0 
0 oO 


Refer to Example 17.5. Show that the solution to the REML estimating 
equations is given by 


R O O-R O O 
O R O O-R O 
.2 1,| 09 O R O O -R 1 -l 
Oo" = py y, where R= . 
-R O O R O O -1 1 
O -R O O R O 
Oo O-R O O R 


17.11 


17.12 


17.13 


17.14 


17.15 


17.16 


17.17 


PROBLEMS 505 


Refer to Example 17.5. Show that X(X’X) 'L/QL(X’X) 'X’Z is 
idempotent. 


Refer to Example 17.5. Show that w and (LB Q(LB) are independent 
chi-square variables. 


Refer to Example 17.5. Let c’ =(1 0 0 -I1 O O), let w be as in 
(17.14), and let d=3. Show that if v is such that ww/d)= 
[c’(X’ > Oe then v is not distributed as a central chi-square random 
variable. 

To motivate Satterthwaite’s approximation in expression (17.16), consider 
the model in which y is N,(XB,%), where X is n x p of rank k, Y = oI 
and Y=s7l. If ce’ B is an estimable function, show _ that 
(n — k)[e'(X'S |X) e]/[e'(X’E |X) e], is distributed as y2(n — k). 

Given f(a) =[e(X’S'X)c], where o= (oj07---07) — and 


m 
m 
X = >) o/7Z;Zi, show that 
i=0 


(SX) XS D7 XS Xe 
roi 0-4) Ya, 9 pb. 0 Sale’ AV A es. <0, me. 97 


Of(o) _ 
do 


(XE IX XS ZZ! SIXES Xe 


Consider the model in which y is N,(XB,&), let L be a known full-rank 


g X p matrix whose rows define estimable functions of B, and let > be the 
REML estimate of }. As in (17.19), let D = diag(Ay, A2, ...,Am) be the 
diagonal matrix of eigenvalues and P = (p;, po, ..-,p,,) be the orthogonal 


matrix of normalized eigenvectors of [L(X’>'X)- Ly. 
(a) Show that (LB — LB) L(x’ 'X)- LL}! (Lp — LB) = 


& x 2 
> [PL B-Lp)] /A. 
i=1 


(b) Show that (p'LB)*/d; is of the form ¢'B/ (X'S xyre as in (17.14). 

(c) Show that cov(p'LB.p,LB) = 0 for i # 7. 

Consider the model in which y = XB + Za-+ e, where € is N(O, o ],) anda 

is N(O, G) as in (17.24). 

(a) Show that the linear function B(y—X)_ that minimizes 
Ela — By — X)]'[a— Bly — X)] is GZS '(y — XB). 

(b) Show that B=GZ'> '(y— Xp) also “minimizes” 
E[a — By — X)][a — B(y — X)]'. By “minimize,” we mean that any 
other choice for B adds a positive definite matrix to the result. 


506 


17.18 


17.19 


17.20 


17.21 


17.22 


17.23 


17.24 


LINEAR MIXED MODELS 


Show that [I —X(X’S"'X)-X’S PS — X(X’S XY XS Y= D- 

X(X’S'X)~X’ as in (17.29). 

Consider the model described in Problem 17.17. 

(a) Show that the best linear predictor of Ua is 
E(Ualy) = UGZ'>'(y — XB). 

(b) Show that cov[E(Ualy)] = UGZ/S"'ZGU'. 

(c) Given B = (X'S'X)~X’S"y, show that 


cov[UGZ'S'(y — XB)] = UGZ'[S | — 3! x(x’! x)-X’S | ]ZGU'. 


Consider the one-way random effects model of Example 17.6. Use (2.52) and 
(2.53) to derive the expression for & ie 


Using (17.29), derive the covariance matrix for EBLUP(a) where a; is 
defined as in (17.30). 


Consider the model described in Problem 17.17. Use (4.27) and assume that 
> and G are known to show that 


cov(aly) = G — GZ/S'ZG. 


Use the model of Example 17.3b (subsampling). Find the covariance matrix 
of the predicted batch effects using (17.31). Comment on the magnitudes of 
the off-diagonal elements of this matrix. 


Use the model of Example 17.3b. Find the covariance matrix of the trans- 
formed residuals 3 '/ 2 y — xB) using (17.32). Comment on the off-diagonal 
elements of this matrix. 


18 Additional Models 


In this chapter we briefly discuss some models that are not linear in the parameters or 
that have an error structure different from that assumed in previous chapters. 


18.1 NONLINEAR REGRESSION 


A nonlinear regression model can be expressed as 
w=fe,B+e, 1=1,2,...,n, (18.1) 


where f(x;, ) is a nonlinear function of the parameter vector B. The error term ¢; is 
sometimes assumed to be distributed as N(O, a’). An example of a nonlinear model is 
the exponential model 


Yi = Bo + Bye + 8}. 


Estimators of the parameters in (18.1) can be obtained using the method of least 
squares. We seek the value of B that minimizes 


OB) = So Lyi — f(x, BY. (18.2) 
i=1 


A simple analytical solution for B that minimizes (18.2) is not available for nonlinear 
F(X, B). An iterative approach is therefore used to obtain a solution. In general, the 
resulting estimators in B are not unbiased, do not have minimum variance, and are 
not normally distributed. However, according to large-sample theory, the estimators 
are almost unbiased, have near-minimum variance, and are approximately normally 
distributed. 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


507 


508 ADDITIONAL MODELS 


Inferential procedures, including confidence intervals and hypothesis tests, are 
available for the least-squares estimator B obtained by minimizing (18.2). 
Diagnostic procedures are available for checking on the model and on the suitability 
of the large-sample inferential procedures. 

For details of the above procedures, see Gallant (1975), Bates and Watts (1988), 
Seber and Wild (1989), Ratkowsky (1983, 1990), Kutner et al. (2005, Chapter 13), 
Hocking (1996, Section 11.2), Fox (1997, Section 14.2), and Ryan (1997, 
Chapter 13). 


18.2 LOGISTIC REGRESSION 


In some regression situations, the response variable y has only two possible out- 
comes, for example, high blood pressure or low blood pressure, developing cancer 
of the esophagus or not developing it, whether a crime will be solved or not 
solved, and whether a bee specimen is a “killer” (africanized) bee or a domestic 
honey bee. In such cases, the outcome y can be coded as 0 or | and we wish to 
predict the outcome (or the probability of the outcome) on the basis of one or 
more x’s. 

To illustrate a linear model in which y is binary, consider the model with one x: 


¥=Pot Bixite; y,=0,1; i=1,2,...,n. (18.3) 


Since y; is 0 or 1, the mean E(y;) for each x; becomes the proportion of observations at 
x; for which y; = 1. This can be expressed as 


Evy) = Poi = D = Pi 


(18.4) 
1 — E(yi) = PCy = 0) = 1 — pi. 


The distribution P(y; = 0) = 1 — p; and P(y;= 1) = p; in (18.4) is known as the 
Bernoulli distribution. By (18.3) and (18.4), we have 


E(y;) = pi = Bo + Bix. (18.5) 
For the variance of y;, we obtain 


var(y;) = Ely — ECP 
= pi(] — pi). (18.6) 


By (18.5) and (18.6), we obtain 


var(yi) = (Bo + Bixi)C — Bo — Bix), 


18.2 LOGISTIC REGRESSION 509 


and the variance of each y; depends on the value of x;. Thus the fundamental assump- 
tion of constant variance is violated, and the usual least-squares estimators Bo and B, 
computed as in (6.5) and (6.6) will not be optimal (see Theorem 7.3d). 

To obtain optimal estimators of By and B,, we could use generalized least-squares 
estimators 


B = (x'’V- Ixy x’v-ly 


as in Theorem 7.8a, but there is an additional challenge in fitting the linear model 
(18.5). Since E(y,;) = p; is a probability, it is limited by 0 < p; < 1. If we fit (18.5) 
by generalized least squares to obtain 


Pi = Bo + Bixis 


then p; may be less than 0 or greater than 1 for some values of x;. A model for E(y,) 

that is bounded between 0 and 1 and reaches O and | asymptotically (instead of 

linearly) would be more suitable. A popular choice is the logistic regression model. 
ePot Bix 1 

1 + ePot Bix — 1 + e-Bo-Bix 


pi = Ey) = (18.7) 


This model is illustrated in Figure 18.1. The model in (18.7) can be linearized by the 
simple transformation 


in( a ) = By + Bix (18.8) 
1— pj 


sometimes called the logit transformation. 


EtY} 


1.04 


EY} = exp(—10+.1X) 


0.5 ~ (1+exp(-10+.1X)) 


50 100 150 


Figure 18.1 Logistic regression function. 


510 ADDITIONAL MODELS 


The parameters By and fB, in (18.7) and (18.8) are typically estimated by 

the method of maximum likelihood (see Section 7.2). For a random sample y, y>, 

.., Yn from the Bernoulli distribution with P(y;= 0) = 1— p; and P(y; = 1) = p;, 
the likelihood function becomes 


L(Bo, By) = FO, 92, «+++ Yn Bos Bi) = []ii Bo» Bi) 
= [La — pi. (18.9) 
i=1 
Taking the logarithm of both sides of (18.9) and using (18.8), we obtain 
In (Bo, Bi) = Bo + Bix) Soin ee): (18.10) 
i=l i=1 


Differentiating (18.10) with respect to By and f, and setting the results equal to 
Zero gives 


n n 1 
2% 7 S 1 + e-bo-Bixi (18.11) 


i=1 


49% = —— (18.12) 
i=1 


I 1+ e—Po—Bixi 


These equations can be solved iteratively for Bo and Bi. 

The logistic regression model in (18.7) can be readily extended to include more 
than one x. Using the notation B= (Bo, B1,.-., Bp’ and xj = (1, x1, Xj2,---, Xin) 
the model in (18.7) becomes 


exiB 1 
Di = EC) = 


ieee Ae e* 


and (18.8) takes the form 


in( Fs } =a (18.13) 
1—p; 


where x’ B = Bo + Byxin + Box +--+ + By xXKm. For binary y; (y; = 0, 1;i= 1,2,...,7), 
the mean and variance are given by (18.4) and (18.6). The likelihood function and 


18.3. LOGLINEAR MODELS 511 


the value of 6 that maximize it are found in a manner analogous to the approach 
used to find By and f;. Confidence intervals, tests of significance, measures of 
fit, subset selection procedures, diagnostic techniques, and other procedures 
are available. 

Logistic regression has been extended from binary to a polytomous logistic 
regression model in which y has several possible outcomes. These may be ordinal 
such as large, medium, and small, or categorical such as Republicans, Democrats, 
and Independents. The analysis differs for the ordinal and categorical cases. 

For details of these procedures, see Hosmer and Lemeshow (1989), Hosmer et al. 
(1989), McCullagh and Nelder (1989), Myers (1990, Section 7.4), Kleinbaum 
(1994), Stapleton (1995, Section 8.8), Stokes et al. (1995, Chapters 8 and 9), 
Kutner et al. (2005), Chapter 14, Hocking (1996, Section 11.4), Ryan (1997, 
Chapter 9), Fox (1997, Chapter 15), Christensen (1997), and McCulloch and 
Searle (2001, Chapter 5). 


18.3. LOGLINEAR MODELS 


In the analysis of categorical data, we often use loglinear models. To illustrate a log- 
linear model for categorical data, consider a two-way contingency table with frequen- 
cies (counts) designated as y,; as in Table 18.1, with y;, = )°5_) yy and yj = )7j_1 Yi- 
The corresponding cell probabilities p;; are given in Table 18.2, with p;. = ee Pij 
and pj = )7j=1 Pi- 

The hypothesis that A and B are independent can be expressed as Ho : py = PiPj 
for all i, 7. Under Ho, the expected frequencies are 


E( yy) = pip j- 
This becomes linear if we take the logarithm of both sides: 
In E( yj) = Inn + Inp;. + Inp;. 


TABLE 18.1 Contingency Table Showing Frequencies y,; (Cell Counts) for 
anr X s Classification of Two Categorical Variables A and B 


Variable B, By wits B, Total 
A, Yul 12 tee Vis Jie 
Ay 21 22 tee 2s y2- 
A, Yrt Yr2 Yrs Yr 


Total V4 yo — Vs y.=n 


512 ADDITIONAL MODELS 


TABLE 18.2 Cell Probabilities for an r x s Contingency Table 


Variable By B, Ph: By Total 
A, Pu P12 oo Ps Pi. 
Az P21 P22 tee P2s P2. 
A, Pri Pr2 Se Sr Prs Pr. 
Total DP p.2 gee Des p.=1 


To test Ho: pj = pi.pj, we can use the likelihood ratio test. The likelihood function 
is given by the multinomial density 
n! 
Yu, Y12 Yrs 


}Pil Piz ** Pris - 
rs* 


L(piisP125 «++» Prs) = a a ar 
YiYi2+-° 


The unrestricted maximum likelihood estimators of p;; (subject to >> Pi = 1) are 
Pi =yii/n, and the estimators under Ho are pj = yiy j/n (Christensen 1997, 
pp. 42-46). The likelihood ratio is then given by 


R= TTT (%2) - 


i=t jet \G 


The test statistic is 


2InLR = 2 yyin(2), 


which is approximately distributed as WalG — 1) — 1)]. 

For further details of loglinear models, see Ku and Kullback (1974), Bishop et al. 
(1975), Plackett (1981), Read and Cressie (1988), Santner and Duffy (1989), Agresti 
(1984, 1990) Dobson (1990, Chapter 9), Anderson (1991), and Christensen (1997). 


18.4 POISSON REGRESSION 


If the response y; in a regression model is a count, the Poisson regression model may 
be useful. The Poisson probability distribution is given by 


yeu 
ions, , y=0,1,2,.... 


18.5 GENERALIZED LINEAR MODELS 513 
The Poisson regression model is 
y =EQy)t+e, i= 1,2,...,0, 
where the y;’s are independently distributed as Poisson random variables and pj; = 


E(y;) is a function of x}B = Bo + Bix +--+ + Byxix. Some commonly used func- 
tions of x/B are 


w= XB, p,=e"F, wu, = n(x’). (18.14) 


In each of the three cases in (18.14), the values of 4; must be positive. 
To estimate B, we can use the method of maximum likelihood. Since y; has a 
Poisson distribution, the likelihood function is given by 


LB) =] [foo =]] <a , 
i=1 i=1 ce 


where yy, is typically one of the three forms in (18.14). Iterative methods can be used 
to find the value of B that maximizes L(f). Confidence intervals, tests of hypotheses, 
measures of fit, and other procedures are available. For details, see Myers (1990, 
Section 7.5) Stokes et al. (1995, pp. 471-475), Lindsey (1997), and Kutner et al. 
(2005, Chapter 14). 


18.5 GENERALIZED LINEAR MODELS 


Generalized linear models include the classical linear regression and ANOVA models 
covered in earlier chapters as well as logistic regression in Section 18.2 and some 
forms of nonlinear regression in Section 18.1. Also included in this broad family 
of models are loglinear models for categorical data in Section 18.3 and Poisson 
regression models for count data in Section 18.4. This expansion of traditional 
linear models was introduced by Wedderburn (1972). 

A generalized linear model can be briefly characterized by the following three 
components. 


1. Independent random variables y;, y2,...,y, with expected value E(y;) = p,; 
and density function from the exponential family [described below in (18.15)]. 


2. A linear predictor 


X/B = Bo + Bix +--+ + Byxit- 


514 ADDITIONAL MODELS 


3. A link function that describes how E(y;) = p, relates to x} B: 


8(ui) = x;B. 
4. The link function g(y;) is often nonlinear. 


A density f(y;, 0;) belongs to the exponential family of density functions if 
f(s, 9) can be expressed in the form 


FOvin 9) = expl yi6i + BCG) + c(yi)]. (18.15) 


A scale parameter such as o* in the normal distribution can be incorporated into 
(18.15) by considering it to be known and treating it as part of 6;. Alternatively, an 
additional parameter can be inserted into (18.15). The exponential family of 
density functions provides a unified approach to estimation of the parameters in gen- 
eralized linear models. 

Some common statistical distributions that are members of the exponential family 
are the binomial, Poisson, normal, and gamma [see (11.7)]. We illustrate three of 
these in Example 18.5. 


Example 18.5. The binomial probability distribution can be written in the form of 
(18.15) as follows: 


nj ,; ea 
SO Pi) = Cc pia — pi)" ™ 


i 


nj 
— exp)yilnp, — y, Indl — p;) +n; Indl — pj) + in )| 
Ji 


Pi ni 
= exp] y; In +n,Ind — pj)+In 
1—p; Yi 


= exp[ yi + D(A) + c(y)], (18.16) 


where 6; = In[p;/(1 — pj], b(6;)) =n; In( — p;) = —njIn(1 + e®), and c(y,) = In e@ : 
The Poisson distribution can be expressed in exponential form as follows: \~' 


pwrie Hi 
f(Yis Mi) = i = exp[y; Inu; — mw; — In(y;!)] 


= exp[yi6; + b(6;) + c(yi) I, 


where 6; = Iny,;, b(0;) = —w; = —e®%, and c(y;) = — In(yj!). 


18.5 GENERALIZED LINEAR MODELS 515 


The normal distribution N(u,, a”) can be written in the form of (18.15 ) as 
follows: 


1 2 
: re —(i- Bi) /20° 
Fi, Mi) Ono e 


= exp[ v6; + b(6,) + cO)], 


where 6; = p,;/07, b(6;) = o 6 /2, and c(y;) = —y?/20? — 5In (27707). 


To obtain an estimator of 6 in a generalized linear model, we use the method of 
maximum likelihood. From (18.15), the likelihood function is given by 


L(B) = | | explyi6i + 0(6,) + c(i]. 


i=1 


The logarithm of the likelihood is 
InL(B) = S_ yi + Y_ b(O) +S) ely). (18.17) 
i=1 i=1 i=1 


For the exponential family in (18.15), it can be shown that 
E(yi) = Bj; = —5'(6), 


where b'(6;) is the derivative with respect to 6;. This relates 6; to the link function 
8(ui) = X;B. 


Differentiating (18.17) with respect to each 6;, setting the results equal to zero, and 
solving the resulting (nonlinear) equations iteratively (iteratively reweighted least 
squares) gives the estimators B;. Confidence intervals, tests of hypotheses, measures 
of fit, subset selection techniques, and other procedures are available. For details, see 
McCullagh and Nelder (1989), Dobson (1990), Myers (1990, Section 7.6), Hilbe 
(1994), Lindsey (1997), Christensen (1997, Chapter 9), and McCulloch and Searle 
(2001, Chapter 5). 


516 ADDITIONAL MODELS 

PROBLEMS 

18.1 For the Bernoulli distribution, P(y; = 0) = 1-—p; and P(y; = 1)=p; in 
(18.4), show that E(y;) = p; and var(y;) = p;(. — p;) as in (18.5) and (18.6). 

18.2 Show that In [p;/(1 — p;)] = Bo + B,x; in (18.8) can be obtained from (18.7). 


18.3 Verify that In L(B, B,) has the form shown in (18.10), where L(Bp, B,) is as 
given by (18.9). 


18.4 Differentiate In L(G, B,) in (18.10) to obtain (18.11) and (18.12). 
18.5 Show that b(6;) = —nIn(1 + e®), as noted following (18.16). 


APPENDIX A 
Answers and Hints to the Problems 


Chapter 2 


2.1 


2.2 


2.3 


Part (i) follows from the commutativity of real numbers, aj + by = by + aij. 
For part (ii), let C=A+B. Then, by (2.3), C’ = (cj) = (cj) = 
(aj + by) = (aj) + Oy) =A +B’. 


7 4 
(a) A’ =| -3 9). 
2 5 
ae 7 —3 2 
(b) (Ay =| -3 9 a ; 3) =A. 
2-5 
65 15 34 
be pf Oe 
(c) wae (i 90 2 a= (95 oo) 


10 2 -1 13 
w ape (' 2), mon (-! 8) 


(b) |A| = 10, |B) =—7, |AB| = —70 = (10)(—7). 
(c) |BA| = —70 = |ABI. 


pf WO 3 mitle: % 
(d) (AB) ( 5 a B/A ( ; a 
(e) tr(AB) =4, tr(BA) = 4. 


(f) For  AB,A; = 10.6023, A. = —6.6023. For BA,A; = 10.6023, 
Az = 6.6023. 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


517 


518 ANSWERS AND HINTS TO THE PROBLEMS 


2.4 @ A+B=(1 ; a a-B=(~ 
Iv “3 3°26 

(b) A’= | 3 -7], B=|-2 9}. 
-4 2 ot 
11 4 11 

(c) (A+ BY = 2}, A+B=|1 2). 
9 1 9 


2.5 The (ij)th element of E=B+C is ej = bj + cj. The (i)th element 
of AE is )>, dixegy = Yo, aik(bag + CK) = Yop Gikbej + GikCe)) = Yop Gikbg + 
yo, dikCkj, Which is the (ij)th element of AB + AC. 


26 19 —29 
2.6 (a) ap=(*) a BA=|{ 10 44 0 |. 
56 -2 54 


eee 
13. 47 48 80 
0 8 phe oy 11): AB+O=(_% . 
8 0 
48 80 
aB+ac=(_% oe) 


@anr=(% 4). w= (3 41) 


(d) tr(AB) = 72, tr(BA) = 72. Soe 
(e) (@B)= (35 33), (aB)= (1 37), AB=(") 37]. 


(f) (ab = (7°), abs = (33). ap=(** a 
0 

2.7 (a) AB= | 0 
0 


1 
(b) x = any multiple of -') ; 
—1 


= 
| 
= 
Nn 
| 
Wo 
N——_” 


Ree 


(b) B+ C= 


ooo 
ooo 
eG 
I 
io) 


(c) rank(A) = 1, rank(B) = 1. 


2.8 (a) By (2.17), aj =a,;-lta.-1+-:-+a,-l1= 0h, 4;. 


ay 
(b) If A= . |, then Aj = . — 


2.9 


2.10 


2.11 


2.12 


2.13 


2.14 


2.15 


ANSWERS AND HINTS TO THE PROBLEMS 519 


By (2.16), (ABC) = [(AB)C]'=C'(ABY = C'B’A’. 

(iii) (A‘A’) = A’(A’Y = AVA. 

(iv) The ith diagonal element of A’A is a/a;, where a; is the ith column of A. 
Since aja; = yi ay = 0, we have a; = 0. 


24 «9 «(21 40 9 42 
DiA=( 4 —10 a aps=(_ fh 5 Ba 


a 2b 3c a ab 3ac 
DA=| 4a 5b 6c], DAD=| 4ab 5b% 6cb }. 
Ta 8b 9 Tac 8bce 9c? 


y Ay = ayy + anys af 333 + 2aj2y1y2 + 2ai3y1y3 + 2239293. 


26 9 6 12 
(a) Bx = | 20 }. (h) xy = {| -3 -2 -4}. 
19 6 4 8 


go> 1h 98 
(b) y'B = (40, — 16,29). @BB={-11 14 —21 }. 


28 —21 34 
6 15 
(c) x’/Ax = 108. (Gj) yz =| 4 10]. 
8 20 
a , {6 4 8 
(d) x’Cz= — 29. (k) zy’ = & 10 a 
(e) x’x = 14. ) JVy’y = V29. 
Ie rw { 14 -7 
(f) x’y = 15. mm) ce= ( ¥ ae) 


9 -—3 6 
(g) xx’ = | —3 1 -—2}. 
6 —2 4 
6 0 
(a) xt+y=[{1], x-y=| -3 
6 2 


ll -3 6 
(b) tr(A) = 13, t(B)=12, A+Bil 6 2 24], tr(A+B)=25. 
5 -1 12 
29 —20 30 41 —2 35 
(c) AB= 5 -3 7], BA= | 34 —-6 23 }. 
46 —25 44 28 5 35 


(d) tr(AB) = 70, tr(BA) = 70. 
(e) |AB| = —403, |BA| = —403. 


520 ANSWERS AND HINTS TO THE PROBLEMS 


29 5 «46 29 5 «46 
(f) (ABY = | -20 —3 -—25], BYA’={ —20 -3 —25 |. 
30 7 44 30 7 44 


6 =) 3 18 2 6 26 
2.16 Bx=3 eh Came Oe ees =({ 21 —~1]}+[ 0)=] 20}. 
2 =3 5 6 3 10 19 
27 «+16 27 16 
2.17 (a) (ABY =| -12 -6], B/A’=] -12 -6 ]. 
19 1 19 11 
, a= (7 a6 y= (7 aa 
(b) Reh) BIO A. I Be Oe 
(; ;) (; 6 (; —6 *) 
IB = = =B. 
0 1/\5 O 3 Caen 0 ee 


(c) |A| = 1. 


A. ifn oe 2B 
was =) 
Sipe, of BEN oO ke 
CoC ame (ay ee he 
a oe ae es Sy of BA 
(f) (A’) aC ) ae Ae (A ee oe 
2.18 (a) If C = AB, then by (2.35), we obtain 
Cy, = An By + Ai2Br 
@ ae 1 : (;) ps fH 
TNS: SING “Ea NG ( ) 
i: 3 anne 6 eG 9 A 
eee G0: Oy TN Sr Se 


Continuing in this fashion, we obtain 


8 9 5 6 
AB=1{7 5 5 4]. 
3 4 2 2, 


8 9 5 6 
(b) AB= {| 7 5 5 4 ]} when found in the usual way. 
3 42 2 


2.19 


2.20 


2.21 


2.22 


2.23 


2.24 


ANSWERS AND HINTS TO THE PROBLEMS 521 


Ds 28. OG 6 7 3 6 
(a) AB=abi+A.3Bo=|3 3 3 0]4+14 2 2 4 
et. Per i 


> —2 3 —7 
ww=2(7) +43) 9) = Ga) 
7 3 1 23 
—7 
Ab = ( 73 ) when found in the usual way. 


By (2.26), AB = (Ab;, Ab, , ..., Ab,). By (2.37) each Ab; can be expressed 
as a linear combination of the columns of A, with coefficients from bj. 


3 0 2 3 0 2 
22) Way (ese mee i eal Ue eo (a (eal ees men 
2 1 0 2 ef 0 
~6 0 2 9 0 2 
=e eee ak eee (ec ea Vs (em) es ema Des ee eee 
—4 3 0 =) 1 0 
a 
=| -4 -3 |] =AB. 
Sa -a, 


> a). Then cya + c0 + 
0 and c; 40. 


Suppose a; = 0 in the set of vectors aj, ®,... 
---+c,a, = 0, where c, = cz Cj-1= Cin = = Cy 
Hence, by (2.40), aj, ,...,4a, are linearly dependent. 


If one of the two matrices, say, A, is nonsingular, multiply AB = O by A! to 
obtain B = O. Otherwise, they are both singular. In fact, as noted following 
Example 2.3, the columns of AB are linear combinations of the columns of 
A, with coefficients from b,. 


AB = (bi, a; +--+ + Daan, bi2a) +---+ Dyray,...) 
= (0,0, ... , 0). 


Since a linear combination of the columns of A is 0, A is singular [see (2.40)]. 
Similarly, by a comment following (2.38), the rows of AB are linear combi- 
nations of the rows of B, and B is singular. 


522 


2.25 


2.26 


2.28 


2.29 


2.30 


2.31 
2.32 
2.33 


ANSWERS AND HINTS TO THE PROBLEMS 


AB= f 1) CB= ( i) rank(A)=2, rank(B)=2, rank(C)=2. 


a 8 5 = 2C11 +r C{3 Cit 212 A @ 
(a) AB= ( I i CB = & Wigan ee) Oe, ), C is not unique. 


é 1 2 6 
An example is C= (_| 1 ) 


ears NY fia Oy <2: male 
(b) 1 2 |= 0 gives two equations in three unknowns 


0 -il 
: 1 
with solution vector x; | —5 |, where x; is an arbitrary constant. We 
1 
can’t do the same for B because the columns of B are linearly independent. 
2s) D3 
(a) An example is B = 1 4 4 |. Although A and B can be non- 
—-1 -1 3 
singular, A — B must be singular so that (A — B)x = 0. 
—1 1 1 
(b) An example is C = 1 —4 1 |. In the expression Cx = 0, we 
2 1 —4 


have a linear combination of the columns of C that is equal to 0, which is 
the definition of linear dependence. Therefore, C must be singular. 


A’ is nonsingular by definition because its rows are the columns of A. To 
show that (A’)'=(A~')', transpose both sides of AA”! =I to obtain 
(AA ') =, (A_')'A’=L. Multiply both sides on the right by (A’)!. 


(AB)! exists by Theorem 2.4(i). Then 


AB(AB)! =I, 
A'AB(AB)! = Av], 


B'B(AB)"! = B*!A7!. 
223. 1 Ge if D2 ae | 
ap= (75 ae B = (3 a CAB a0\ cag oat le 


Multiply A by A’! in (2.48) to get I. 
Multiply A by A“! in (2.49) to get I. 
Muliply B + ce’ by (B + ee’) ! in (2.50) to get I. 


ANSWERS AND HINTS TO THE PROBLEMS 523 


2.34 Premultiply both sides of the equation by A + PBQ. The left side obviously 
equals I. The right side becomes 
(A + PBQ)[A! — A7'PB(B + BQA7'PB)'BQA™!] 
= AA! + PBQA™! — AA7!PB(B + BQA! PB) 'BQA7! 
— PBQA’'PB(B + BQA' PB) 'BQA™! 
= 1+ P[l— B(B+ BQA 'PB)! — BQA~'PB(B + BQA | PB) ']BQA! 
= 1+ P[l— (B+ BQA'PB)\(B+ BQA'PB)']BQA7! 


=I1+4+ P[l—I[BQA"! 


2.35 Since y’A’y is a scalar and is therefore equal to its transpose, we have 
ve Aly = (y'A’yy = y(AY(y' = y'Ay. Then Zy'(A+ A’Jy = py'Ay + 
aviAy= sy Ay + 3y'Ay. 
2.36 Use the proof of part (i) of Theorem 2.6b, substituting > 0 for >0. 
2.37. Corollary 1: y'BAB’y = (B’y)'A(B’y) > 0 if B’y 0 since A is positive defi- 
nite. Then B’y = y,b,;+- ++ + y,b;, where b; is the ith column of B’; that is , b/ 


is the ith row of B. Since the rows of B are linearly independent, there is no 
nonzero vector y such that B’y = 0. 


2.38 We must show that if A is positive definite, then A=P’P, where P is non 
singular. By Theorems 2.12d and 2.12f, A = CDC’, where C is orthogonal 
and D = diag(A,,A2, ..., A») with all A; > 0. Then A = CDC’ = CD!” 
D'?2c = (D'!?C \D!2C!) = PP, where D!/? = diag(/A}, VAd, ..., 
4/ Ap): Show that P= D~'?C’ is fonsineular 


2.39 This follows by Theorems 2.6c and 2.4(ii). 


2.40 (a) rank(A, c) = rank(A) = 3. Solution x; Z, XQ =, x3 2. 
(b) rank(A) = 2, rank(A, c) = 3. No solution. 
(c) rank(A, c) = rank(A) = 2. Solution x) = 7, x2 + x3 + x4 = 1. 


2.41 By definition, AA A= A. If A is xm, then for conformability of multipli- 
cation, A’ must be m X n. 


it 40 
2.42 AA; | 0 


2 
1 0 
1 2 
2 2 = 0 -2 0 1 
ae an({ a al =-4(_§ = ii) 


524 ANSWERS AND HINTS TO THE PROBLEMS 


2.44 Let C be the lower left 2x2 matrix C= e Then Co! = 


2 ON 1 ree ai 
(5 1)=(4 p)maer= (6 


2.45 (i) By Theorem 2.4(i), rank(A A) <rank(A) and rank(A) = rank 
(AA A) < rank (A A). Hence rank (A A) = rank(A). 
(ij) (AA A)’ = AA YA’ 
(iii) Let W = A[I— (A’A) A’A]. Show that 
W'W = [I — (AA) AVAJA‘A — A’A(A‘A)A‘A] 
= [I— (A‘A)_A‘A]JO = O. 


Then by Theorem 2.2c(ii), W = O. 

(iv) A[(A’A) A’JA = A(A/A) A’‘A = A, by part (iii). 

(v) (Searle 1982, p. 222) To show that A(A’A) A’ is invariant to the choice 
of (A’A) , let B and C be two values of (A’A). Then by part (iii), A = 
ABA’A and A = ACA’A, so that ABA’A = ACA’A. To demonstrate 
that this implies ABA’ = ACA’, show that 


(ABA’A — ACA’A)(B’A’ — C’A’) = (ABA’ — ACA’) 
x (ABA! — ACA’). 
The left side is O because ABA’A = ACA’A. The right side is then O, 
and by Theorem 2.2c(ii), ABA’— ACA’ = O. To show symmetry, let S 
be a symmetric generalized inverse of A’A (see Problem 2.46). Then 
ASA’ is symmetric and ASA’ = ABA’ since ABA’ is invariant to 


(A’A). Thus ABA’ is also symmetric. To show _ that 
rank[A(A’A) A’] =r, use parts (i) and (iv). 


2.46 If A is symmetric and B is a generalized inverse of A, show that ABA = 
AB’A. Then show that 5(B + By’) and BAB’ are symmetric generalized 
inverses of A. 


2.47 (i) By Corollary 1 to Theorem 2.8b, we obtain 


0 0 0 
A ={0 4 0 
00} 


(ii) Using the five-step approach following Theorem 2.8b, with 
C= i a) defined as the upper right 2x2 matrix, we obtain 
0 
1 0 
0 


ee fo 8 oe 
CU=H1 1 +) and AT = | 0 
2 1 

2 


2.48 (b) By definition, AA A=A. Multiplying on the left by A’ gives 
A’AA” A=A’A. Show that (A’A) | exists and multiply on the left by it. 


ANSWERS AND HINTS TO THE PROBLEMS 525 


2.49 (iv) If A is positive definite, then by Theorem 2.6d, A can be expressed as 
A = P'’P, where P is nonsingular. By Theorem 2.9c, we obtain 


|A| = |P’P| = |P’||P| [by (2.74)] 
= |P||P| [by (2.63)] 


= (PP >0 [by (2.61)] 


(vi) |JA'A| = I] = 1 
|A“]A|=1 [by (2.74)] 


? |A~'] = 1//Al. 
2.50 ial=| a= #0, note that A is nonsingular 
a a) 
aj=|, a1 al 
5 3 
Pee eae) any Pee Be nog 
et 2/7 [62 [20s 
1 3 
ae w(? 5) (2 ) Pe 501 09 
mb AB ON a Rag: aoe |G. 80) 
[2 5 
10 13 = 100(1) = 100 


(b) [cA] = |cIA] = |cl|/A] = c"|Al 


2.52 Corollary 4. Let Ay; = B, Ago = 1, Ao; = ce’, and Ay = ec. Then equate the 
right sides of (2.68) and (2.69). 


a |AB| = |A||B| = |B||A| = |BAI, 


|A*| = |AA| = |AI|A] = |A’| 


4 -—2 23 1 
Oy a Qf 9 2) BY ics 


-1 
2.55 Define B= ( Ai  ): Then 


I ATA 
BA = a ae co iF 
é Ax — Ao) Az] A 


526 


2.56 


2.59 


2.60 


ANSWERS AND HINTS TO THE PROBLEMS 


By Corollary 1 to Theorem 2.9b, |BA| = |Ay. — Ay Aj Avl- By Theorem 
2.9c. |BA| = |B||A]. By Corollary 1 to Theorem 2.9b and (2.64), 
|B) = |Aq'| = 1/lAnl- 


We first show that sine ¢;¢; = 0 for all i# j, the columns of C are linearly 
independent. Suppose that there exist a}, d2,...,da, such that a,e,+ 
aye) +++-+a,e, = 0. Multiply by ¢' to obtain aye}e; + age\en + +--+ 
ayC\Cp =C}0=0 or aje\e; = 0, which implies that a, = 0. In a similar 
manner, we can show that a2 = a3 = --- = ad, = 0. Thus the columns of C 
are linearly independent and C is nonsingular. Multiply C’'C =I on the 
left by C and on the right by C!. 


1/V3 -1/V2 1/V6 
(a) C= [ -1/V¥3 0 2/6 
1/V31/V2 1/6 


(i |I| = |C’C| = |C'||C| = |C||C| = |C/’. Thus |C)* = 1 and |C| = +1. 
(ii) By (2.75), |C'AC] = |ACC’| = |All] = | Al. 
(iii) Since cie; = 1 for all i, we have cje; = a ch = |, and the maximum 
value of any Gh is 1. 


(i) The ith diagonal element of A+ B is aj; + bj. Hence tr(A + B) = 
Yo; (ai + bi) = YO; aii + YO; bu = tr(A) + tr(B). 

(iv) By Theorem 2.2c(ii), the ith diagonal element of AA’ is a‘a;, where a/ 
is the ith row of A. 


(v) By (ii), tr(A’A) = >, a/a; = >>, SS ap, where aj, = (aj1, di2, ..-, Gip). 
(vii) By (2.84), tr(C/AC) = tr(CC’A) = tr(IA) = tr(A). 
me | 5. 2 2 
5 2 
B=]|0 2], BB= (> . BB’ =|]2 4 O], 
1 O 2 0 1 
(BB) =5+5=10, t(BB)=5+4+1=10. 
(iii) Let b; be the ith column of B. Then 


2 2 1 
S/ bib; = (2,0,1) | 0 | +(1,2,0){ 2 } =5+5=10. 
i=l 1 0 


(iv) Let bj be the ith row of B. Then 


AVON @ 0,2 1,0 : 
Ym =@,n(7) +0.2(5) +4.0( 5) 


=5+4+1=10. 


ANSWERS AND HINTS TO THE PROBLEMS 527 


10 3 5 


31 2 14 1 
2.61 A= , AA=| 3 1 2], AA’= ‘ 
10-1 1.2 

5.25 


i(A'A) = 16, t(AA')= 16, 0a; =P+P+2P4+P 40 4(- 17 = 16. 


2.62 (A-A) = A-AA~A = A’A since AA” A = A by definition. Hence AWA 
is idempotent and tr(A” A) =rank(A A) =r=rank(A) by Theorem 
2.8c(1). Show that tr(AA”) = r by a similar argument. 


2-9, 3 0 1.0 <0. a 
268 A=/]101],A=/0 -3 $],aA=]0 1 3], 
cae 0 0 0 00 0 
tr(A~A) = 2, 
Ot. 4 
AA~=]0 1. O |, tr(AA~) =2, rank(A~A) = rank(AA7~) = 2. 
0 0 
=e peace te. : 
=24, _ xX5> = > = : 
2.64 : nen ~1 4-2) \x 0 
—x, + 2x. = A 
—x, + 2x. = , 
y= .2%9, 


(2)- C2) eG) 
x2 = => = X2 2 
9) Xx2 1 
Use x2 = 1/ V5 to normalize x): 
2//2 
x2 = : 

1/v5 

2.65 From A’x = A’x, we obtain AA?x = A’ Ax = A?Ax = A’x. By induction 
AAW!» = OT Ax = NOx = Ax. 


2.66 By (2.98) and (2.101), A‘ = CD‘C’, where C is an orthogonal matrix con- 
taining the normalized eigenvectors of A and Dp‘ = diag(A*, ee faced Ah). 
If —1 < Aj < 1 forall i, then D‘ — O and A‘ > O. 


2.67 (AB — ADx = 0, 
(BAB — AB)x = 0, 
(BA — ADBx = 0. 


528 


2.68 


2.69 


2.70 


2.72 


2.73 


2.74 


2.75 


ANSWERS AND HINTS TO THE PROBLEMS 


0=|P-'AP — Al| = |P"'AP— AP“'P| 
= |P-'(A— ADP| = |(A — ADP! P| 
= |A—Al]. 
Thus P~' AP and A have the same characteristic equation, as in (2.93). 
q 


Writing (2.92) for x; and x, we have Ax; = A;x; and Ax; = A;x;. Multiplying 
by x; and x} gives 


x,AX; = AX'Xi, (1) 
x/,AX; = Ajx;X;. (2) 
Since A is symmetric, we can transpose (1) to obtain (x/,Ax;)’ = A(x) xi)’ or 


x’,Ax; = A;x;x;. This has the same left side as (2), and thus A,x’;x ; = A,x’x; or 
(A; — Aj)xix; = 0. Since A; — Aj # 0, we have xx; = 0. 
By (2.101), A= CDC’. Since C is orthogonal, we multiply on the left by C’ 
and on the right by C to obtain C/AC = C’CDC’C = D. 


—.5774  .8165 0 
C= 5774 = .4082) —.7071 
5774 4082 7071 


(i) By Theorem 21.2d, |A] = |CDC’|. By (2.75), |CDC’| = |C’CD| = |Dj. 
By (2.59), |D| = []L, Ai- 


(a) Eigenvalues of A: 1, 2, —1 


8018 3015 .7071 
Eigenvectors: x} = | .5345 }, x2= | 9045 ], w%=] 0 
.2673 3015 .7071 


(b) (A) =14+2—-1=2, |A|=()2\—D =-2. 


In the proof of part (i), if A is positive semidefinite, x;Ax; > 0, while 
xix; > 0. By Corollary 1 to Theorem 2.12d C/AC=D, where D= 
diag(A;,A2, ...,A,). Since C is orthogonal and nonsingular, then by 
Theorem 2.4(ii), the rank of D is the same as the rank of A. Since D is diag- 
onal, the rank is the number of nonzero elements on the diagonal, that is, the 
number of nonzero eigenvalues. 


(a) |A| =1. 


(b) The eigenvalues of A are .2679, 1, and 3.7321, all of which are positive. 


ANSWERS AND HINTS TO THE PROBLEMS 529 


2.76 (a) (Al/2y = (cD!?C’y = (Cy D!?yC = cp!2c’ = Al/2. 
(b) (A!/2)2 = Al/2Al/2 = cp!?c’cp!2c' = cp!/2p!/2c' —=CDC’ =A 


2 eS eS (“ee em = Gs 


arceore=(Z)(1 (9 1) 


_1/1+v3 1-¥v3 
6 2\1- v3 14+V3)) 


2.78 (i) (— AY =1-2A+A?=I-2A+A=I-A. 
(ii) AU — A) = A— AX =A-A=O. 
(iii) (P-' AP)? = P-!APP~!AP = P-!A?P = P-!AP. 
(iv) (C'ACY = C'ACC'AC = C'A°C = CAC, 
(CAC) =C'A(C) = CAC if A= A’. 


2.799 (A~A)? = A-AA~A = A7A, since AA~A =A. 
[A(A‘A)~A'? = A(A’A)-AVA(A‘A)~A! = A(A’A)“A, since = A= 
A(A‘A)_A/‘A by Theorem 2.8c(iii). 

2.80 (a) 2, (e) 2, (f) 1, 1, 0. 


2.81 By (2.107), tr(A) = S72? Ai. oe? of Section 2.12.2 and (2.107), tr(A”) = 
P_, Ay. Then [tr(A)]” = 7?) AF +2 oy 4g; AiAj = t(A7)t 20,2; Aidj. 


OH  OB(BAB’)'B 


2.82 x a : 
1\-1 
ie O(BAB’) B, 
Ox 
B’ 
= B'(BAB’)~ oe (BAB’)'B, 


a 
= —B(BAB’) ‘B= B(BAB)) 'B, 


WA 
ee oe 


2.83 Let X= (5 . such that ac > b*. Then, 


OIn|X|__ AIn(ac — b*) 
OX = = OX ° 


530 ANSWERS AND HINTS TO THE PROBLEMS 


Oln(ac—b*) A In(ac—b?) 
Oa Ob 


OIn(ac—b?) A In(ac—b*) ? 
Ob Oc 


1 c —2b 
-—3(_ ) 

2 c —2b 1 c 0 
-—3(_4 -)-a—a(4 me 
= 2X"! — diagx"!. 


1 0 
2.84 The constraints can be expressed as h(x) = Cx—t whereC = | 1 1 ] and 
0 1 


t= @ . The Lagrange equations are 2Ax + C’A = 0 and Cx = t, or 
2A C x\ /0 
C oO AJ Atty 
The solution to this system of equations is 
x\_ (2A C\'/0 
A} \C O t} 
Subsituting and simplifying using (2.50) we obtain 


1/6 


x= a and a= (12). 


Chapter 3 


3.1 By (3.3) we have 


yf) dy = aE(y). 


Boy = | Bio = a| 


3.2 E(y — py = EG? — 2py + pw) 
= E(y’) — 2uE(y) + pw’ [by (3.4) and (3.5)] 
= EQ?) — 2p? + p> = EQ”) — w’. 
3.3 var(ay) = E(ay — apy [by (3.6)] 
= Elay — pw) = Ela*(y — pw’) 
= @EVy- py [by (3.4)]. 


ANSWERS AND HINTS TO THE PROBLEMS 531 


3.4 The solution is similar to the answer to Problem 3.2. 


3.5 E(yiyj) = Ws [. yinif (iy) dyi dy; 
= | fwtrosionariay [by (3.12)] 
= |»gon | iho a dy, 
= B(x) | filo ayy = BODELD. 

3.6 cov( yi, yj) = E(yiyj)— wij — [by (3.1))] 


= MiB — BiB; [by (3.14)]. 
3.7. (a) Using the quadratic formula to solve for x in y= 1+ 2x — x* and 


y=2x — x’, we obtain x = 1 + 1/2—yandx =1 + \/1—y, respecti- 
vely, which become the limits of integration in 3.16 and 3.17. 


3.8 (a) Area ={? fay ax + if. es dy dx = 2. 


(b) fi = | te Ayes. 
1 


a 
@ 
5 
re) 
@ 
= 
i 
a 
— 
II 
NI 
= 
\ 
eS 
A 
we 


3 
E(x) = | x( 5 dx = 2, 
1 


1 2; 
EG) = [ sory + | W2—ydy =$42=1, 


2 px 3 pdx 
Buy) =| | sy Jody ae + | | xy (4)dy dx = 2, 
1 Jx-1 2 J3-x 


Oy =2—2%A1)=0. 


532 ANSWERS AND HINTS TO THE PROBLEMS 


_f@y)_ 2 
fil) — 4 


(c) f (yx) =a, 


Bui = | y( Idy = x — 5, 1<x<2, 
1 


4—x 
Bob) = | yW(1)dy = 3 — x, 2<x<3. 
3-—x 


x+y, E(x +y1) 
Xo + x E(x2 + y2) 
3.9 E(x+y)=E ; _ 
Xp T Yp E( Xp “alr Yp) 
E(x1) + E(y1) E(x1) E(y1) 
E(x2) + E(y2) E(x2) E(y2) 
E( xp) + E(yp) E( xp) E(yp) 
x] YI 
x2 y2 
=F +E 
Xp yp 


3.10 El(y — w)(y — w)'] = Elyy’ — yo’ — py’ + wp] 
= E(yy’) — E(y)w! — pE(y’) + E(ee’) [by (3.21) 
and (3.36)] 
= E(yy’)— pe’ — pp + pp. 


3.11 Use the square root matrix >!” defined in (2.107) to write (3.27) as 


(y— WIS "(y— w= (y— WCEP? yy — 


= (C2? "Cy — ICs") "(y — wl = Zz, say. 


Show that cov(z) = I (see Problem 5.17). 


ANSWERS AND HINTS TO THE PROBLEMS 533 


a1) cov) =cov(¥) = BC mae may by (3.24) 
>. « 


x My x By 
= 6(2 oe Vey ayy r= He) 
xX— Py, 


(y— By — BY Cy — By) BY! 


=E 
(x = BOC _ My)! (x _ By)(x = B,) 


El(y — my)Cy — By] Ely — By )(x — BY] 
El(x— By — BY) ELK = wx y)'] 


eae, [by (3.34)] 
= Yn zy) y (3. 


3.14 (i) If we write A in terms of its rows, then 


/ / 
a; ay 
a} aby 
Ay=|.], y= 
ai, aly 


Then, by Theorem 3.6a, E(a‘y) = a/E(y), and the result follows by 
(3.20). 

(ii) Write X in terms of its columns x; as X = (X,, X2, ..., Xp). Since Xb is a 
random vector, we have, by Theorem 3.6a 


E(a'Xb) = a'E( Xb) 


= al E( bx; + boxy + +++ + by Xp) [by (2.37)] 
= a'[b) E(x1) + 2E(%) +--+: +, E(x,)] 
= al[E(x;), E(x2),..., E(x,)|b [by (2.37)] 
= a'E(X)b. 
ai 
7 a) 
(iii) E(AXB)=E|| — | X(bj, by, ..., bp) 


534 ANSWERS AND HINTS TO THE PROBLEMS 


a, Xb, ajXb2 --- a'Xb, 
a,Xb, aXbo --- a5Xb, 

=E 
a, Xb; a,Xbo --- a.Xb, 
a E(X)b, al E(X)by ++. al E(X)b, 
aE(X)b, aE(X)b) --- aE(X)b, 
alE(X)b, al E(X)by ++» al E(X)b, 
a 
a 

=| — |£(X)(bi, bo, ..., bp) = AE(X)B. 
ay 


3.15 By @.21), E(Ay + b) = E( Ay) + E(b) = AE(y) + b. Show that E(b) = b. 
if b is a constant vector. 


3.16 By (3.10) and Theorem 3.6a, we obtain 


cov(a’y, b’y) = El(a'y — a'p)(b’y — b’p)] 
= E[(a'(y — p(y — wb] [by (2.18)] 
=a’E[(y— w)(y— p)'|b [by Theorem 3.6b (ii)] 
=a'=b [by (3.24)]. 


3.17 (i) By Theorem 3.6b parts (i) and (iii), we obtain 


cov( Ay) = E[(Ay — Ap)(Ay — Ap)’ 
= E[(ACy — wy — pA’) 
= AEl(y— wy — w)JA' 
= AXA’ [by (3.24)]. 


(ii) By (3.34) and Theorem 3.6b(i), cov( Ay,By) = E[( Ay — Ap)(By— 
Byp)’}. Show that this is equal to AB’. 


ANSWERS AND HINTS TO THE PROBLEMS 535 
3.18 By (3.24) and (3.41), we have 


cov( Ay + b) = E[Ay + b — (Ap + b)][Ay + b — (Ap +b)!’ 
= E[Ay — Ap][Ay — Ap]’. 


Show that this is equal to AXA’. 


3.19 Let z= ,K= 


sae ue 
COoOF 
oReok--5e) 


Oo O 
Oo O 
C O 
Oo D 
Bx, C 


(O oO !I I). Then cov(Ay + x, Cv + Dw) 


= LK cov(z) K’M’ 


yy ae a, OTA O' 
=(A BO 0) ie li OV Sa O’ 
Ze So Dae ay C 
Zwy 2x Bw Bw 7 \D 


= AXwC + BEyC' + A2y,D! + BED". 


3.20 (a) E(z) =8,  var(z) =2. 


(b) x)= (_4). covin) = (_7) ee 


6 6 —14 18 

3.21 (a) E(w)=|{ —-10], cov(w)= | —14 67 —49 |. 
6 18 —49 57 

(b) cov(z, w) = e =e ak 


Chapter 4 


4.1 Use (3.2) and (3.8) and integrate directly. 
4.2 By (2.67) [B7)=|ce'?)} =|e7'7|"1. We now use (2.77) to 


obtain || = Dae aed = >a |p| = (ee form which it follows 
that [%/?| = |z|'”. 


536 ANSWERS AND HINTS TO THE PROBLEMS 


4.3 Using Theorem 2.14a and the chain rule for differentiation (and assuming that 
we can interchange integration and differentiation), we obtain 


Oety = ty Ot'y _ 


= yetY 
Ot . ot ye 5) 
OM,(t) _ O tly 7 O ty 
at = 5 fe F(ydy = | \oe f (y)dy, 
OM, (0 
- = |: vs | vecyndy = E(y) [by (3.2) and (3.20)). 
ely O 1, Oy ra } 
4.4 = tty = ty ~ ty 
Ot, Ots Ot, (« Ot, ) Ot, (yse ) Yrys€ *. 


4.5 Multiply out the third term on the right side in terms of y—p and St. 
4.6 My-p(t) = ElefO-] = E(efY EM) = e TREC et) = e THe ett 


4.7 E(et4Y) = E(e4Y), Now use Theorem 4.3 with A’t in place of t to 
obtain 


E( et AY) = el AD e+( /2)(A'HSA't _ et (A+ /2)¢( ASA 


ft Wn rf 2 
4.8 Let K(1) = In[M(0). Then K’(1) =“ and K"(1) =O — al Since 


M(0) = 1, K”(0) = M"(0) — [M'(0)]|* oO. 


49 CSC =C(eI)C = &CC' =o’. Use Theorem 4.4a (ii). 
4.10 The moment generating function for z= Ay + b is 
M,(t) = E(et”) = E(e"AY*P)) = E(ef AYHtD)) — etd E(et AY) 
_ et bot (Am) +t (AXA’)t/2 [by (4.25)] 


— et (Amtb)+t (AXA’)t/2 


which is the moment generating function for a multivariate normal random 
vector with mean vector Aw + b and covariance matrix AXA’. 


ANSWERS AND HINTS TO THE PROBLEMS 537 


4.11 Use (2.35) and (2.36). 

4.12 By Theorem 3.6d(ii), cov( Ay, By) = AXB’. 

4.13 Write g(y, x) in terms of & ) and } = (se = ) For |X| and 5, see 
x xy XX 


(2.72) and (2.50). After canceling A(x) in (4.28), show that f (y|x) can be 
written in the form 


(Y= Myx) Bye (Y¥—My.y)/2 


1 
fly) = (2mP|,,.|2° 


where py. = By + ZyrBy(X — My) and Bye = By — Bye By Bay 
4.14 cov(y — Bx, x) = cov { L —B) 0 ). (0,1) ( )| . Use Theorem 3.6d(ii) 


vi Ys 1 7 ars | 
4.16 (a) (3) ism] (5). (4 Ih 
(b) y> is M(2, 6). 
(c) z is N(—4, 79). 


ae ()emlah(. 8)] 


pg & ‘| 
e ‘ ; =N. 2 : : 
(e) f (yi, yaly3, ya) |( ys +Lyq > 4 


—1 
= fa So 6. (22 Ve 2 
ee | 2 GDNAT 3 
cov(yiyely2={( 1 5}—-l3 _a}l_o 4 oS Sa 
Thus 
L43y+2y4 g 2 
st5y2— 4) \E 
(g) p13 = —1/2VS. 


(h) py3.04 = 1/V6. Note that p;3.5, is opposite in sign to pj3. 
(i) Using the partitioning 


f (v1. y3|2, 4) = No 


538 ANSWERS AND HINTS TO THE PROBLEMS 


we have 
6 8 SON fig? 
E( yily2, 3, ya) =14+(2 -1 2)] 3 5 -4 y3—3 
id A ya +2 
1/4-1/4 -1/8 
; y2 —2 
=14+(2 -1 2) 7 5/4 9/8 y3 —3 
+2 
1/8 9/8 21/16) \** 
yo  y3 Sy4 
ae ae 
soa 
1/4 -1/4 -1/8 2 
var(y1|¥2, ¥3, Y4) = 4 —(2 1 2)|-1/4 5/4 9/8 = 
1/8 9/8 21/16 2 
=4-3=1. 


Thus f (1 |Y2,.3,¥4) = N(1 + 1/2y2 + 1/2y3 + 5/4ya, 1). 
4.17 (a) N(17, 79). 
oni al 


(c) f (v2ly1, y3) = N(—3 + 41 + 43, 17/12). 


2 4 1 
(d) f(y po=Ml (tay, GG :)]: 


(©) py, = V2/4 = 3536, pin, = 3/20 = .3873. 


4.18 y, and y> are independent, y, and y3 are independent. 
4.19 y, and y» are independent, (y;, y2) and (y3, y4) are independent. 
4.20 Using the expression in (4.38) for &,, in terms of its rows o;,, show that 


ryrl ryol ryol 
Oa Fx O12 02x eae 12a D px 

/ 77 / = / Ex: 
O42, O1x O24 O2x aes 04.24 D px 


. 23, = 


a i 1 A / “1 
Oey Oix Ox Rex O2x ee O px O px 


ANSWERS AND HINTS TO THE PROBLEMS 539 


Chapter 5 


n n n 
51 > 01-9 =)_ OF 2: +P) =>) y7 -29) “y; tay’ =) y} -2ny? + ny? 
i=1 i=1 i=1 i I 


5.2 By (2.23)[(1/nJP = (A /n) iii’ = A/P)j(n)¥ = C1 /ndji’ = CA /n). 


5.3. (a) By Theorem 5.2b we obtain 


a oh arly (1-73)y 
atl (el (bat) +aees(1-Laih 
7, t( =25 3) + 4u? o%(n—n)| eo 
> 


(b) var(s*) = var (= |. a var(u) 


_ 20° 


lee ou as 


5.4 Note that 6'V~! = p’~!. Because of symmetry of V and &, we have 
Vv '@= Sp. Substituting into the expression on the left we obtain 


—(1/2) 


>> eae | = 2axy st eo lw ew (I-2AR) 1S" w)/2 


= [E—2t AS [1/9 9 20S) NE a2, 


yl — Ly/y- — 1 
5.5 Expanding the second expression we obtain e[#> #-9V '@ty'V 'y-20'V ly4 
9Vv'4/2_ Substituting 6’ and V~', simplifying, and noting that @’V~! = 

fa >|, we obtain the first expression. 


11adjc| 1 dC is 
5.6 k(p= 2G | sc ht C~'~'. Using the chain rule, 
fh Afalei a a @\Cl dC ee i 
kK p= wc 1 Cc 1 1 
) 2\c/? | | 2|C| dt? eae dt % 
geen cs Oe eee 1 , ,dC 1 dC, 15-1 py 
5c pre CUy pet 5uC - C 7 > 


540 


5.7 


5.8 


ANSWERS AND HINTS TO THE PROBLEMS 
yAy=(y—M+ pyA(y— B+ B) 
=(y— wYA(y— we) +(y— w/Amt+ WACy— pw) + WAP 
=(y— w/A(y— w) + 2(y— w/Apt+ wAp 


To show that E[(y — w)(y — pw)’ ACy — | = 0, we need to show that all 
central third moments of the multivariate normal are zero. This can be 
done by differentiating My_,(t) from Corollary 1 to Theorem 4.3a. Show that 


fou My_,(t) 4 
oT BA, — Qf 1/2tSt ae a 
ar,0t,dt,  ~ ur » 105 | + Orr » tyOuj 


(5) (Ee) (Gee) mE) 


Since there is a ¢; in every term, OM yal t)/0t, Ot; Ot, = 0 for t= 0 and 


E[(yr — My)C¥s — Ms) Yu — My)| = 0 for all r,s, u. 
For the second term, we have [by (3.40)] 


2El(y — w)(y — Ap) = 2{E[(y — w)(y — mw) }Au 
= 2DAp. 


For the third term, we have 


El(y — wtr( AX)] = [ECy — pal[tr( AX)] = O[tr A] = 0. 


5.9 By definition 


cov( By, y’Ay) = E{[By — E( By)]ly’Ay — E(y’Ay)]} 
= E{[BCy — p)]ly’Ay — E(y'Ay)]} 
= BE{(y — w)]Ly'Ay — E(y'Ay)]} 
= Bcov(y, y'Ay) = 2B Ap. 


5.10 In (3.34), we have 2yx = ElCy — wy )x — p,)'. Show that E(yx’) = 


Lyx + Py M, Then 


E(x'Ay) = E[tr(x'Ay)] = E[tr(Ayx’)] 
= tr[E( Ayx’)] = tr[AE(yx’)] 
= tr[AC Dae a BK = tr( AXyx + AB, H,) 


= tr( AXx) + tr( ApH) = tr( Ady) + tr( MAB) 
= tr( Ady) ok MAB,. 


ANSWERS AND HINTS TO THE PROBLEMS 541 


SAL (@) Si DOT) = Lh i — Bi HHA = DA FL 
yo xi t+ nxy = Do xiyi — nxy — nxy + nxy. 
(b) With X = (1, X2,---5 Xn), Y= (M15 Yao - +s Yn) X = (1/n)j’x, and 
y = (1/n)j'y, we have 


a 1 o o/ 1 les/ / 1 
nxy =n|—) jxjy=-xjjy=x|—J]y, 
n n n 


7 1 1 
So xpi — nxy = x'y — x’ (‘3)y =x’ (1 — ay. 
= n n 
5.12 Apply (5.5), (5.9), and (5.8) with wp = 0, A = J, and =I. The results follow. 


5.13 By (5.9), var(y’Ay) = 2tr(A)* + 4p’ADAu. In this case, we seek var(y’y), 
where y is N,Cw, 1). Hence, A= > = I, and 
var(y’y) = 2tr(1)? + 4p’ = 2n4 8A. 


Since Tis n x n, tr) = n, and by (5.24), 4’ pe = BA. 


5.14 InM,(t) = —(n/2)In(1 — 24) — A[1 — (1 2t) |], 
In M,(t 
cL Eee | Qe a 
dt 1—2t 
d\n M,(0) ase 
dt 
d@ InM,(t) 2n 3 
= + 8A(1 — 20)”, 
dt (1277 ( ) 
ad InM,(0) 
ge 2n + 8X. 
5.15 Since vj, v2,..., ¥, are independent, we have 


2) 


Ms,,,(1) = Ele) = Bee ve 
= E(e™ )E(e™”)... E(e) 


k 
_ ‘ eo MlI-1/C1-20)] 
II. vif l= yr 


= ! el 1/1-20)3as 
(1 — 2nee/? 


Thus by (5.25), Sj; is x°( Sinj, DiAi). 


542 


5.16 


5.19 


5.21 
5.22 
5.23 
5.24 


ANSWERS AND HINTS TO THE PROBLEMS 


(a) ? =x?/(u/p) is FU, p) since 2? is ¥ (1), u is ¥°(p), and z* and u are 
independent. 

(b) 2 = y’/(u/p) is F(1,p, 4 pu?) since y* is x°(1, 4”), wis x°(p), and y* 
and u are independent. 


EIS"? (y = w)] = 27'7[E(y) — pl] = 0. cov[E/7(y — pw] = E"cov(y— 

ps? ay! Pyy PP? ay Py tPylP yl? 1. Then by Theorem 

4.4a(ii), S/?(y — p) is N,(0, D. 

(a) In this case, = o7I and A is replace by A/o*. We thus have (A/o”) 
(o°I) = A, which is indempotent. 

(b) By Theorem 5.5, y/(A/o)y is x°(r, A) if (A/o*)% is idempotent. In this 
case, Y = o°I, so (A/o*)% =(A/o*)(o°1I) = A. For A, we have 
A=36(A/o)p = wAp/20°. 


By Theorem 5.5, (y — w)' > '(y — w) is x(n) because AX = Y'Y =1 

(which is idempotent) and E(y — #) = 0. The distribution of y’S~'y is 

x(n, A), where A = Llp. 

All of these are direct applications of Theorem 5.5. 

(a) A=5p/Ap=50'A0=0. 

(b) AX = (I/o*)(7° I) =1, which is idempotent. A= 5 m( I/o°)p= 
pb p/20-. 

(c) In this case “AX” becomes ( A/o7)( 07%) = AX. 

BYA = B(DA = @BA, which is O if BA = O. 

J1- 1/n)J] = 1 - /nii) =F - O/nyiy =F - A /n(ny = 0. 

ASB = A(o21D)B = 02 AB, which is O if AB = O. 

(a) Use Theorem 4.4a(i). In this case a = j/n. 


ae z _ = w/(o/yn) 
Jul(n—1) J/[(n— Is?/0?2/(n — 1 
z= (¥— p)/(a/V/n) is NO, 1). 
(c) Let v = (¥— My)/(0//n). Then E(v) = (ww — Mo)/(a//n) = 5, say, 
and var(v) = [1/(07/n)] var(y) = 1. Hence v is N(6, 1), and by (5.29), 
we obtain 


. Show that 


v _Y— Mo 


je aT: is t(n—1, 6). 
n—1 


Thus 8 = (tu — pp)/(o/ Vn). 


ANSWERS AND HINTS TO THE PROBLEMS 543 


5.25 By Problem 5.2, (1/n)J and I—-(1/n)J are idempotent and [I — (1/n)J] 


5.26 


5.27 


5.28 
5.29 
5.30 


5.31 


5.32 


[(1/n)J] =O. By Example 5.5, $7, (yi — 52/7 =y'[I—(1/n)J]y/o? 
is y(n-1). Show that ny*/o7 =y'[(1/n)J]y/o? is y°(1,A), where 
A=SmAp=np’?/20°. Since [[—(1/n)J(1/n)J] =O, the quadratic 
forms y'[(1/n)J]y and y'|[I—(1/n)J]y are independent. Thus by (5.26), 
ny?/[2_, O71 — 92 /(n — D] is FUL,n—1, A), where A=np2/202. If 
w=0 (Ao is true), then A=O and ny’ /[o, Oi — 9 /(n- 1)] is 
Pe 1); 


(b) Since 


yi Oi — _y(l-(C1/nJly | ! A 
(1 — p) a1 — p) “4 


we have 


A ee 
o(1—p) o(1—p) 


[I — (1/n)J][C1 — pt + pS. 


Show that this equals (I — ly ), which is idempotent. 


(a) E(y’ Ay) = tr(A®) + p’Ap = —16. 

(b) var(y/Ay) = 2tr(AX)’ + 4p/ADAp = 21,138. 
(c) Check to see if AX is indepotent. 

(d) Check to see if A is idempotent. 


A= ' =diag(!, 1, 1), 1p/!Ap = 2.9167. 

A= 1, Lp'Ap = 27. 

(a) Show that A is idempotent of rank 2, which is equal to tr (A). Therefore, 
y Ay/o is x7(2, w’Ap/207), where + p/Ap = $( 12.6) = 6.3. 


(b) BA= CG : ny # O. Hence y’Ay and By are not independent. 


(c) yj +y2 +3 = jy. Show that j/A = 0’. Hence y’Ay and y; + y2 + y3 are 


independent. 


(a) Show that B is idempotent of rank 1. Therefore, y’/By/o is 
C1, w’By/207). Find 5 p'By. 
(b) Show that BA = O. Therefore, y'By and y’Ay are independent. 


(a) A? = X(X'/X)~!X’X(X'X)~ |X’ = X(X’X)"!X’ = A. By Theorem 2.13d 
rank( A) = tr (A) = tr [X(X/X)~!X’]. By Theorem 2.11(ii), _ this 


544 ANSWERS AND HINTS TO THE PROBLEMS 


becomes tr [X(X’X)~!X’] = tr( I,) =p. Similarly, rank tr (I— A) = 
n—p. 

(b) tr[A(o2D)]=tr[o2X(X'X)-)X"]=po?. p!Ap=(Xb)'X(X'X)-EX/(Xb) = 
b'X’Xb. Thus Ely’Ay]=po2+b'X’Xb. Show that tr[(I — A)(o2D] = 
o°(n—p) and p'(I— A)w=0, if w=Xb. Hence Ely’(I— A)y] = (n— 
po. 

(c) yAy/o? is) ¥(p,A), where A= p'Ap/20? = b/X'Xb/20°. 
y(I— A)y/o? is x'(n—p). 

(d) Show that X(X'X)~!X’[I — X(X’X)-!X’] = O. Then, by Corollary 1 to 
Theorem 5.6b, y’Ay and y'(I — A)y are independent. 

(e) F( p,n—p, A), where A = b'X'Xb/20°. 


Chapter 6 


6.1 Equations (6.3) and (6.4) can be written as 


n n 


Soi ~ nBy _ Bi yx = 0, 
i=l 


i=1 


n n n 
So xvi — By Soi — By S03? = 0. 
i=1 i=1 i=1 


Solving for By from the first equation gives By = ~ iyi/n— Bids) %i /n= 

y— Bix. Substituting this into the second equation gives the result for Bi. 

6.2 (a) Show that 3™, (4% —D0Oi—- 3) = OL, Gi — Dy. Then B, =D; 
(x; — X)yi/c, where c= 30; (x; — xy. Now, using E(y;) = Bo + Bix: 


from assumption | in Section 6.1, we obtain (assuming that the x’s are 
constants) 


E(B,)=E 


Ke: anf = Soi -DE(y)/c 
i=1 i 

= S101 D(Bo + Bixd/c 

= Bo i -D/e+ Bi (i -Dai/c 


=0+B, > )Oi-DOi-D/e=B, Yi -9"/) i-9)? = By. 


ANSWERS AND HINTS TO THE PROBLEMS 545 


OM ee e( Senin) - EAs 
= DLEOD/2 = Bix = >| (Bo + Bixd/n — BL 


= > Bo/n + By doxi/n — B\X = nBo/n+ B,x — B\xX = Bo. 


6.3 (a) Using Bi = OL, (i — Dyi/c, as in the answer to Problem 6.2, and 
assuming var(y;) = a and cov( Yi, ¥j) = 0, we have 


Ps Jit 1 
var( B,) = alm — ¥)’var( yi) = 2d (4 — 90? 


= o ye (i — 37 7 & 
ep Da Ge 


(b) Show that By) can be written in the form By = yi /n— 
X 1 (4; — Dy;i/c. Then 


var( Bo) = vad Se F — | vf 


i=1 


- n nc C2 
i=1 


>; AGA), PGT 


n 2x eee - 
-«/% mG: cD De a 
Btpe pee Ad 
= gl 042 Din (i at 

nt [oie 1 — XY] 


1 res 
=e F a pee =I , 


546 ANSWERS AND HINTS TO THE PROBLEMS 


6.4 Suppose that k of the x;’s are equal to a and the remaining n—k x;’s are equal 
to b. Then 
_ ka+(n—k)b 
x = ——_—_—__, 
n 


n 2 3 
ov =n? =a aa Hn b> ern by 


2 2 
= [MoD Kad) aa pf 
n n 


k n—k 
=3[(n- ka by + > [-kKa b)) 


2 
= a” [k(n —kP + P(n— b] 


2 


= kn b(n-kt+b= 


(a—by (a— bY k(n—k) 
n n . 


We then differentiate with respect to k and set the results equal to 0 to solve 
for k. 


Oy has) ey 
n 


Ak [k( —1)+n—k] =0, 


eee 


6.5 . BR p 
SSE = SOG $i) = SOG — By — ByxiP 
i=1 i 


= 0% —¥+ Bx - Bx = dob ~~ - B(x —-X) 


t 


= 0% — 5% = 28, 80 -WOi- D+ Br Gi Ry 


Substitute Bi from (6.5) to obtain the result. 
6.6 Show that SSE = ¥> (y; — j)? — B? ) (x; — X)°. Show that ¥ = By + B,X +8, 
where € = )~7_, e;/n. Show that E[0_, (i —9)”] = ED, [B, i — + 
6; — €)°} = BLD (i — + (n— Lo? +0. By (3.8), E(B?) = var( B,)+ 
[ECB)P = 0° / Di 01-8? + By. 
6.8 To test Ho: B, =c versus H;: B, # c, we use the test statistic 
Bi =G 


s/\/ oe (9 — 3)? 


— 


ANSWERS AND HINTS TO THE PROBLEMS 547 


and reject Ho if |t| > ta/2,,-2. Show that ¢ is distributed as t(n—2, 5), where 


5= Bi —c 


a/\/ DAE? — xy 


6.9 (a) To test Ho: By = a versus H;: Bo #a, we use the test statistic 


jh By — a 
Li x 
"Vn 3h iP 


and reject Ho if |t| > t/2,n-2,. Show that ¢ is distributed as t(n—2, 8), 
where 


iS Py — a 
ame 
cs n yo, (4 — xP 


(b) A 100 (1—a@)% confidence interval for Bp is given by 


x 


- 1 
SS a he ae eee 
B /2n-2. n eee — xy 


6.10 We add and subtract 4; to obtain 3“, (yi — 9) = OL, Oi — Si +57 —Y)”. 
Squaring the right side gives 


DOr WP = DOr — 50? + DG - WP +29 — Gi - D. 


In the third term on the right side, substitute j; = Bo + Bi xi and then 
Bo =y- Bix to obtain 


0% — WI - Y= 0% — By — Bixi)( By + Bixi — 9) 
= 001 -F + Bix BT — BX + Bix — 3D) 
= Doli -¥— Bia -— DIAC — DI 


=p re — ¥)(x; — ¥) — BP Ie: ae Ge 


548 ANSWERS AND HINTS TO THE PROBLEMS 


This is equal to 0 by (6.5). 
6.11 S25 - 9 = 95 (Bo + Bis — 597 = S09 - BLE + Bix: — 9 
i=l i i 


= BF So (i — 37 


Substituting this into (6.16) and using (6.5) gives the desired result. 
6.12 Since x —Xj = (x, —X,x%) —X,...,X%, — x) and y—yj=(y1) —Y,y2—- 
Y, .. +5 Yn — ¥), (6.18) can be written as 
(x — xj)'(y — yi) 
VI =i = DIC — WG — WD] 


By (2.81), this is the cosine of 6, the angle between the vectors x — xj 


and y — yj. 
5/32; (4-2 Bi \/SSE/(n — 2) 


4/2, (41 — Vn — 2 


JXii-F* — BP (i — 9 


i Jo) (41 — 3 Vn — 2 


=P 
(Bo op Dai = DO1- W/E 


=p; 


[do; (3 =a) 
— O1 - 901 - 9) ij RAED AY Vn 2 


(xj;—-X)2 
dai (%i — %) (50 | Fst 


0% - PY Oa — 9” 


_vn-2r 
~ T= 


6.14 (a) By = 31.752, B, = 11.368. 
(b) t= 11.109, p=4.108 x 107». 
(c) 11.368 + 2.054, (9.313, 13.422). 


SSR _ 6833.7663 
SST —9658.0755 


(d) r= = .7076. 


ANSWERS AND HINTS TO THE PROBLEMS 549 


Chapter 7 


7.1 By (2.18), By + Bxi +++: + Byxix = x'B. By (2.20) and (2.27), we obtain 


yi— xB 
: 1 py2 1D Ip If v2 — xB 
So Oi BY = 01 — XB. 2 — 4B... yn — XB) 
i=l 

Yn — XB 


= (y — XB)'(y — XB). 


This can also be seen directly by using estimates in the model y = XB + e. 
Thus = y — XP, and é’é = (y — XB)(y — XP). 


7.2. Multiply (7.9) using (2.17). Keep y — XB together and XB — Xb together. 
Factor X out of XB— Xb. 


7.3 In (7.12), we obtain 


4 _ nyoimiyi— Quix) — aD xi — (nX)(ny)/n] 


nyo — (Dm) nlx? — (nx)? /n] 
Oi a > nxy 
yar - Soe — ne” 


which is (6.5). For Bo. we start with (6.6): 


A _s Be Divs xii — NXY\ DIX 
& =7- B2=~ (a an >) i 
_ (oi) oa =” nx") [eee = ) eixi 
ne —aB) a? 
7 Doe Deven DS ila) = (a) (no) 40D ln) (7 vl) (DE) 
n()5,27=n3?) 


n 


The second and fourth terms in the numerator add to 0. 


550 


75 


7.6 


7.7 


7.9 


ANSWERS AND HINTS TO THE PROBLEMS 


Starting with (6.10), we have 


et! a fi — x? + nv? 
uy lh tsacel | nd, (a; — 3 
_ oy90 — ni? + eo 
a n>; (xj — X)” , 


The two terms missing in (7.17) are 
[A — (X’X) EX] [XK EX’)! + [CX’X) TX’ [A — (XX) IXY. 
Using AX = I, the first of these becomes 


AX(X'X)~! — (X/X)7 1 X’/XCX’X)! = (X’X)! — (XX)! =O. 


(a) For the linear estimator c’y to be unbiased for all possible B, we have 
E(¢e'y) = e’XB = a’B, which requires e’X = a’. To express var(e’y) in 
terms of var( a’ B) = var[a’( X’X)~'X’y], we write var(e’y) = o°e’e = 
o’[e — X(X/X)-!a + X( XX)! a7! ]/[e — X(_X’X)-!a + X(X’X) a] 
Show that with c/X=a', this becomes [ce — X(X’X) ‘al’ 
[ec — X(X’X)"!a] + a’(X'X)!a, which is minimized by e = X(X'X)!a. 

(b) To minimize var(c’y) subject to c’ X =a’, we differentiate v = oe'c— 
(e’X — a’)A with respect to c and A (see Section 2.14.3): 


Ov/OA = —X'c+a=0_ gives a= X'c. 
dv/Oe = 20°e — XA = 0 gives e = XA/20°. 


Substituting ¢=XA/2c* into a=X’e gives a=X'XA/20°, orA= 
20°(X'X)-!a. Thus e = XA/20? = X(X’X)"!a. 


B, = (ZZ) 'Z'y = (H'X'XH)"'H'X'y 
_ H|( x’x) 1( H’) 'H’X’y 
= H"'!(x’X)'x'y=H'B. 


For the ith row of Z = XH, we have Z; — x'H, or z; = H’x;. Thus in general, 


7.10 


7.11 
7.12 


7.15 


ANSWERS AND HINTS TO THE PROBLEMS 551 


z = H’x, and 


y Biz = (H-'B)'H'x _ 6 (Hy 'H’x 


Since B’x = x’ B is invariant to changes of scale on the x's, x'B is invariant, 
where x;' is the ith row of X. Therefore, XP is invariant, and it follows that 


s? =(y — XB)'(y — XB)/(n — k — 1) is invariant. 

(y — XB)'(y — XB) = y'y — y'XB — B’X'y + B'X’X B. Use (7.8). 

By (7.8), B(X’y) = B'(X’XB). By Theorem 5.2a, E(y’y) = E(y'ly) = 
tr(Io° I) + E(y')IE(y) = no* + B’X’XB. By Theorems 7.3b and 7.3c, 
E(B) =B and cov( B) = 0°(X’X)!. Thus E( B’X’XB) = tr[(X'X) 
o°(X'X)"!] + B/X’XB. 

Let X, and By represent a reduced model with k—1 x’s, and let X and B 


represent the full model with k x’s. Then show that SSE for the full model 
can be expressed as 


SSE = (yy — B,Xiy) — (B'X’y — B,Xiy) 
= SSE,_1 — ( a positive term). 


It is shown in Theorem 8.2d and problem 8.10 that B'X'y — B,Xiy is a 
positive definite quadratic form. 


First show that (1/n)j/X,; = X, where X = (X,, %, ..., ;,), which contains the 
means of the columns of X,. Then 


1 1 1 
(1 = -) X, =X, ——JX, =X, ——jj’X; = X, — jx’ 
n n n 


X11 X12 0° Mk Xy XQ ++ Xk 


X21 X22 aoe XK X{ Xp cr XR 


Xnl Xn2 *°* Xnk Xp XQ ct Nk 


552 


7.16 


ANSWERS AND HINTS TO THE PROBLEMS 


By a comment following (2.25), j/X, contains the column sums of X,. The 
sum of the second column, for example, is yy (x72 — X2) = 
ee X22 — NX) = nX2—nX,=0. Alternatively, = /X,_ = j/(I-— (1/n)J] 
X1 = [i — (1/n)i'ii/]X: = /X; = 0' since jj =n. 


7.17 (a) Partition X and B as X=(j, X,) and B= (#). Then show 
1 


that the normal equations X/XB = X’y in (7.8) become 


BBG) 
xij XX] \ B, Xiy /’ 
from which we obtain 
nBy + 7X1 By = ny () 
X}iBo + X,XiB) = X'y. (2) 
Show that (1) becomes Bo +x’ B, = y, or @ = y. Show that (2) becomes 
nXBy + XX B, = Xiy. (3) 
By (7.33), X- = [[—(1/n)J]Xi. Show that X’X, = XX) — (1/n)X} 
JX; = X/X', — nxx’. Similarly, show that X/y = X\y—(1/n) 


X' Jy = X\y — nxy. Now show that the normal equations in (7.34) for 
the centered model can be written in the form 


which becomes 
n& = ny, (4) 
X’X.B, = X'y. (5) 


Thus (4) is the same as (1). Using xB, =a—- Bo: show that (5) is the 
same as (3). 


(b) Using (2.50) with Ay, =n, Ay = nx’, Ap) = nX, Ay) = X'X1, and 


X)X,. = XX) — nxx’, show that 


hehes n jX\7 n nx’ \! 
Cane ah 
Xj X,X; nx XX) 


1 
—+x/(X(X,) lx —#(X'X,)! 
n 


I 


—(X,X,) 1x (XiX.)! 


ANSWERS AND HINTS TO THE PROBLEMS 553 


and verify by multiplication that (X’X)~!X’X = I. With this partitioned 
form of (X’X)~!, show that 


B= (XX) Xy = (XG. XI! es ) 
1y 


_ (y-2(RX)' Ky) _ fy Bix 
A XX) xy JL OXEX TE Xty 
which is the same as (7.37) and (7.38). 


7.18 Substitute x = %1,%2 = X2,--. 4% =X in Y= Y+ By (x1 —Xy)teee + 
By (Xe — X,) to obtain y = y. 


7.19 yy —BXy=yy— (Bo. BG. Xv'y 
, A D/ jy 
=yy-—(f.B \( 
MAX 
= yy — Bony — B,Xiy 
= yy —(¥— B,Xny — BX\y 
= yy — ny’ — B,(Xiy — nyx) 


= SOG —y) — pBiXly. 
i=l 


7.20 (a) By Theorem 2.2c(i), X,/X, is obtained as products of columns of X,. By 
(7.33), these products are of the form illustrated in the numerators of 
(7.41) and (7.42). 
(b) By (7.43), the numerator of the second element of s,,is )77, 
(x2 —X2)(yi-—y). This can be written as )0,(x2 —X)yi — 
5; (42 —X)¥, the second term of which vanishes. Note that 
PS (xj2 — X2)y; is the second element of X'y. 


7.21 (b) Expand the last term of In L( B, o~) in (7.51) to obtain 


InL( B, 02) = —5In(27) sino? savy 2y/XB + B’X'XB). 
Then 


OInL( B, 07) - 
op 7 
Setting this equal to 0 gives (7.48) 


0-0 53(0 2X'y + 2X/XB). 


554 


7.22 


7.26 


7.27 


ANSWERS AND HINTS TO THE PROBLEMS 


(c) Use In L( B, o~) as in (7.51), to obtain 
OInL( B, 07) =, n 


x xX 
Oc? agi 2 7 (y B) (y - B). 
Setting this equal to 0 (and substituting B from 0 In L/OB = 0) yields 
(7.49). 


(ii) By (7.26), SSE = y'[I — X(X'X)"!X’]y. Show that I — X(X’X)~'X’ is 
idempotent of rank n — k — 1, given that X isn x (k+ 1) of rank k+ 1. 
Then by Corollary 2 to Theorem 5.5a, SSE/o7 is V(n —k—1,A), where 
A= pAp/2o? = (XB) [I — X(X’X)-'X'|(XB/20°). Show that A=0. 

(iii) Show that (X’X)~!X’[I — X(X’X)~!X’] = O. Then by Corollary 1 to 
Theorem 5.6a, 8 = (X'X)'X'y and SSE = y'[I — X(X’/X)"'X'ly are 
independent. 


The two missing terms in (7.52) are 
—(y — XB)'X(B — B) — (B— B)'X'(y — XB) 
—(X'y — X'XB)'( B— B) — (B— By(X'y — X'XB) 
0'(B— B) — (B— By. 


l 


Note that X’y — X’XB = 0 by the normal equations X’XB = X’y in (7.8). 
“A A A ;/ a» A 
BX'y = (8) (Gj, XY'y = (Bo. B( 2 Jy = mBay + BIXiy. 


With B)=y—B,x from (7.38) and X,=[I—(1/n)J]X; from (7.33), this 
becomes 


Al APD 1 
B X'y=n-B,\Xy+B, (xi+2xi5)y 


A A 1, 
=ny —n(B Oy + BX. +— Bi Xidy. 
The last term can be written as (1/n)B, X' Jy=(1/n)B\ X iy =(/n) 
Bir xy, so that B ’X’y= ny 24 6) Xy 
If By = By =--- = B, =0, then B, =0 and B,X.X.B, = 0. If yi = Si, 


i=1,2,...,n, thn y=y= xB and BX'y —ny* = y'y — ny’. Also see 
formulas below (7.61). 


This follows from the statement following Theorem 7.3f, which notes that an 
additional x reduces SSE (see Problem 7.13). 


(a) A set of full-rank linear transformations on the x’s can be represented by 
W = XH, where H is a nonsingular matrix. show that B, = =(Ww)! 
W’y = H"'(X’xX) |X’y = H"'B.. Show that B., Wwy=B, Wy. Then 

al a 
Ro =(B.,W'y — 29) /(y'y— 19) = (BLX'y —ny?)/(y'y— 97) = R 


ANSWERS AND HINTS TO THE PROBLEMS 555 


(b) Replacing y by z=cy, we have z=(1/n)j'z=(1/n)j’cy=cy and 
B. = (X'X)1X’z= (XX) 'X’cy= cBy. Then 


ae 5 al, 9 
_ BX 2 ne cB,X'cy—n(cyy ce? 


R 1 Ree aEN =e 2 
gn — nz (cy) (cy)—n(cyy cr? 
n s/n XB 0 xf 
a JY  JXB_ : 
7.30 do si/n n n n 


By  YX:B, 
=n— 
n 


n 


= By + XB; = By + (5 — Po), 
by (7.38) 
7.31 By (7.61), we obtain 


29 VI -WI-NI+ PII _ BXXB— ny 
21 Oe). DiC — 


cos 


since j/¥ = ny by Problem 7.30. By (7.8), B’X’XB = B'X’'y. 


7.33 (a) Using B = (X’V~'X) | X'V~'y, expand (y — XB)'V~'(y — XB) to 
obtain y'V-ly — y'V_'!X(X’V-!X)!X’V-ly, the second term of 
which appears twice more with opposite signs. 

(b) Use Theorem 5.2a with A = V7! — V-!X(X’/V- EX) I X'V- |S = VV, 
and wp = Xp. 


1 
(y — XB)'V-'(y — XB). 


n n 1 
7.34 In L( B, 0°) = 5 In( 277) 5 Ino” 5inlV| me 


Expand the last term to obtain 
1 
rae | _ yV 'xB aa pXx'v-ly vi pX'V"' XB). 


Differentiate to obtain 


0 In L( B, 02) _ I ae ne 

dB 0-0-0 7529 2X°V y+ 2X'V" XB), 
A In L( B, 02) n I =e 

Jo Die OF ep See ae 


Setting these equal to 0 and 0, respectively, gives the results. 


556 ANSWERS AND HINTS TO THE PROBLEMS 


7.35 Show that J? = nJ. Then multiply V by V_' to get I, where V and V_' are 
given by (7.67) and (7.68) respectively. 


7.36 (a) j'V-'j = aj — bpd)j = aj/j — abpj'jj/j = an — abp*n? = an(1 — bpn). 
Substitute for a and b to show that this is equal to n/[1 + (n— Lp] = bn. 
Then j/V !X,=aj/(I— bpJ)X, = aj’X, — abpj'jj/X.=0' because j/X.= 
0’. Show that X!V~!X, = aX/X.. 


7.37 cov(s") = (X’X)~!X’cov(y)X(X’X)! = 02(X'K) EX’ VX(X'X) I. 


1 Vi 
ee 2 ix, 
7.38 (a) (X'V-'X) 'x’'vly= Xi Mi 
no YO; ii 
iti Nn JE 
1 ix, 


> ; (xi — a) 1 =<) 
7.4 = : ; 
0 (a) var( Ga se ay ” (x; — X)°var(y) 


XP ox. 


1 
Do, i — x7) » 


7.42 cov(B,) = E(B, — E(B,)\[B, — E(B)’. Using E(B,) = B, + AB, from 
(7.80), we have 


B, — E(B,) = B, — By — (%.X1) |X, XoB) 
= (XX)! X\y — (X, Xi) X/ XP — B, 
= (X/X1)'X\(y — XB) — By. 


ANSWERS AND HINTS TO THE PROBLEMS 557 


Show that this can be written as (X',X,)_'X{(y — Xi B,; — X2B;), so that 
cov(B\) = E[(X,X1) 'Xi(y — XBy — XBYX (XX) |). 


7.43 Use Theorem 7.9a and note that 


XB = ox .xf( B = Xo By + Xo hb. 


7.44 Multiply out (Xo; — A’Xo1)'Goo(Xo1 — A’Xo1), substitute A = G7 Gio, and 
use (2.50). 


7.45 E(X),B,)= x)\(B; + AB) 4 XB). 


7.46 var(Xo) B;) > var(xo B, ) 
= 0° (x), G'!xo1 — xy, G7 X01) 
= 0x), (G"! — Gy )xo1 
> 0 because G'! — Gy| = AB“'A which is positive definite 
[see Theorem 7.9c(1i)]. 


7.47 tr[f — X,(X,X1)'X)] = e(D — u[X) XXX) |] =n -(p +, 
BX'[E— Xi(XX1)'X)]XB = (BX), + BX4)[T — Xi (XX) Xi] 
(Xi, + X2f,). 


Show that three of the resulting four terms vanish, leaving the desired result. 
7.48 OME Oi= Bix)” _ 9 
Op, 
25° Oi Bim) - x) = 0. 


7.49 For the full model y; = By + B,x; + €&;, we have 


558 


7.50 


ANSWERS AND HINTS TO THE PROBLEMS 


For the reduced model y; 
X,=(,1,.. 


Bi xi + e}, we have X; = (11,2, ... 
., 1)’. Then from (7.80), we obtain 


E(B,) = B, + AB) = By + (X,X1) X, XB, 


(a) 


~ 
| 
See eRe ee 


j=) 
ohre Or £O 


n 1, 
By + (x) De: * Bo. 
i=1 i=1 


—27 
—8 
—1 

0 
1 
8 
27 


, Xn)’. Thus, 


The first two columns constitute X,, and the last two columns become Xp. 


Then by (7.80), we obtain 


E(B,) = By + (X,X1)'X,Xopp. 


Show that this gives 


Bo 
By 
Bo 
By 


BB) = ( 


- 


) 
) 


€ 
0 2 


ee 
0 7 


0 
8 


)G 
)| 


28 0 
196 


By 
Bs 


Ms.) 


Bo 
Bs 


i: 


so that E(B) = By + 4B, and E(B) = B, + 7P3. 


7.51 X'X>) =X) [KX — Xi(X/X1)-!X).Xo] = XX. — XX (XX) |XX. 


7.52 In the partitioned form, the normal equations x’'xB = X’y become 


(32 Jos 20 


x! X 
xX, 


X! X 
Xx), X> 


) 


B 
B 


( 


y 


Bi, 
B 


2: 
Xby ° 


Xx) XB, +X XB, = Xiy, 
X) Xi B, + X5 XB, = Xy. 


() 
(2) 


ANSWERS AND HINTS TO THE PROBLEMS 559 


Solve for B, from (1) to obtain B, = (X,X:)"!(X/y — X/X2,), and 
substitute this into (2) to obtain 


[X,.X_ — X5X1(X).X1)!X) Xo] Bo = Xby — X}Xi(X, Xi) Xiy. 3) 


Multiplying (7.98) by X54, we obtain B, = (X} .Xo1) 1X) [9 — p(X). 
Show that this is the same as (3). 


1.0150 
—0.0286 

0.2158 |, s? = 7.4529. 
—4.3201 

8.9749 


&> 
I 


7.53 (a) 


3.4645 0145 —.0638 —1.1620 1.0723 
0.0145  .0082 —.0019 —0.1630 0.0784 
(b) .2(x’x)-! = | —0.0638 -—.0019  .0046 0.1039 —0.1250 
1.1620 —.1630 1039 8.1280 —7.2045 
1.0723 .0784 —.1250 —7.2045 7.6875 


380.6684 237.6684 27.0709 25.3549\ ~' 


237.6684 247.5071 17.8557 18.3362 
27.0709 = 17.8557 = 2.1090 ~—:1.9909 
25.3549 =—-18.3362 11.9909 = 1.9369 


©) B = So Sy = 


151.0121 —0.0286 
134.0444 0.2158 
11.8365 | | —4.3201 
12.0140 8.9749 


By = ¥ — BX = 31.125 — ( — 0.0286, 0.2158, — 4.3201, 8.9749) 


57.9063 

55.9063 
= 1.0150. 

4.4222 

4.3238 


(dd) R?=.9261, R?=.9151. 


560 ANSWERS AND HINTS TO THE PROBLEMS 


332.111 
p —1.546 
7.54 (a) B= | a5 |- s° = 5.3449, 


—2.237 


(b) cov(B) = s?(X’X)7! 

65.37550 —.33885 —.31252 —.02041 
—0.33885  .00184 00127 —.00043 
—0.31252  .00127. .00408 +—.00176 
—0.02041 -—.00043 —.00176  .02161 


= 5.3449 


a oo 
(c) R= .9551, Ri = .9462. 


964.929 
—7.442 
—11.508 
—2.140 
(a) B= ee , 82 = 5.1342. 
—0.294 
0.054 
0.038 
—0.102 


(e) R° =.9741, R? = 9483. 


6628 
a 7803 | 9 | 
7.55 (a) B= 5031 | & = 87.9969. 
—17.1002 

504.2783 9.4698 —1.7936 \~' / 428.9086 
(b) B, =Sq'sx =| 9.4698 201.9399 1.0617 90.8333 
~1.7936 1.0617 0.0235 —1.2667 

0.7803 

=| 0.5031 

~17.1002 


By =¥ — B,xX = 41.1553 — (.7803, .5031, — 17.1002) 


42.945 
x 20.169 | = .6628. 
0.185 


(c) R° = .8667, R? = .8534. 


ANSWERS AND HINTS TO THE PROBLEMS 561 


Chapter 8 


8.1 


8.2 


8.3 


8.4 


8.6 


8.7 


Substitute B, = (X/X.)~!X/y into SSR = B,X‘y. 
1 1 

@ 4. (1 a ~3) =H, —-H.J 
n n 


1 

= X,(X).X,) 1X! — = X(XX,) Xx y’ 
n 

= X,(X’X,) 'x’ —O 


since X/ jj’ = Oj’ = O. 
(b) Show that H? = H., where H, = X.(X'.X,)'X’. Then, since H., is 
idempotent, rank (H,) = tr(H,.) by Theorem 2.13d. The centered 


matrix X. is nk of rank k [see (7.33)]. 


2 2 
”) (1-15-1,) (1 ‘3) (1 3) H, u.(1 i) +10 
n n n n : 


1 
=I-——J-—H,—-—H,.+ He. 
n 


Then rank (1-15-1.) =u(1-ts-n,), 
n n 


1 1 
i w.(1-75-H,) =u. (1-75) —~H? =H, —-H, =0. 
n n 


WH. = BX'X,(X.X,) |X XB. By (7.32), we have XB = aj + X.B,. 
Hence p'H.p = (aj! + B,X!)X-(X.X.)'X'(aj + X.B,). Three of the 
resulting four terms vanish because j/X, = 0’ (see Problem 7.16). 


By corollary 2 to Theorem 5.5a, SSE/o7 is x7(n —k —1,A2). Also Az = 
1 1 
mL ——J—H.]a/o° = (ai! + B,X)[I ——J — H.|(@j + X-B,)/o°. Show 


1 
that all terms involving either j’/[I——J] or j/H. vanish. Show that 
n 
1 
B,X. [I ——J]X-B, = B)X.X-B, and that B,X,H-X-B, = B)X,X-A. 
n 


Most of these results are proved in Problem 5.32, with the adjustment k+ 
1=p. 
By (8.14), HH, = X(X'X)"!X’X,(X{X,) 'X) = X\(X{X1) |X) = Mh. 


562 


8.9 


8.11 


8.12 
8.13 


8.14 


8.15 


8.16 


8.17 
8.18 


ANSWERS AND HINTS TO THE PROBLEMS 


»'(H — Hyp = B'X'(H— H))XB 
= p’X'X(X'X) | X/XB — B’'X'X\(X/X,) |X| XB 
= p'X'XB — p’X'X, (XX) |X| XB 
= (BX) + B)X4)(XiB; + XB) 
— (BX), + ByX4)Xi (XX) 1X} (Xi By + X2B)). 


Denote the matrix X’X by G. Then in partitioned form, we have 


x’ 
G = X'X = (Xj, Xs)(Xi, X2) = ( os X») 


1 
/ 
X; 


-_ Ge oo) = ie eS) 

 AXSX) XX) (G2 Gy /’ 
If we denote the four corresponding blocks of G-' by G", then by (2.48), 
G7? =(Gs=GiGi Gs) | By Theorem 2.6e, G ' is positive definite. 


By Theorem 2.6f, G~ is positive definite. By Theorem 2.6e, (G**) |= 
Gyo — GG 1)'G)> = X$5X> — X5X;(XX;) 'XX is positive definite. 


By Theorem 8.2b(ii), SS(B5|B,)/o7 is x7(h, A1). Then E[SS(B,|B,)/o7] = 
h+ 2A, by (5.23). 

oO + Be[xpxe — XX (XX1) Xp xx]. 

For the reduced model y = Bij + €*, we have Bi = (ij) | j/'y = 

(1/n) 307-1 yi = ¥ and SS(B5) = Boj'y =F DV. yi = ny. 

After multiplying to obtain eight terms, three of the first four terms cancel 
three of the last four terms. For example, the second of the last four is 
BXX1A By = BY XX (XX)! XX Bs = BLXX2o, which is the same 
as the second of the first four terms. 

Add and substract ny? in both numerator and denominator of (8.24) and then 
divide numerator and denominator by yy — ny’. 


Express SSH as a quadratic form in y by substituting B = (X’'X)"!X’y. Then 
use Corollary 1 to Theorem 5.6b. 


This follows from Corollary 1 to Theorem 2.6b. 


By the answer to Problem 7.28, B/,W'y = B’X'y. Thus SSE = y’y — B’X’y is 
invariant to the full-rank transformation W = XH. For the numerator of 
(8.27), we note that C is transformed the same way as is X, so 


that C,.B,, = CHH'£. = CB . Thus the numerator of (8.27) becomes 


(CwB,,)[Co(W'W)'C,,] "CB, = (CB {CH[(XH)XH]'(CH)'}“'CB 
= (CB) {CH[H'(X'X)H] 'H'C’}~“'cp 


ANSWERS AND HINTS TO THE PROBLEMS 563 


= (CB) [(CHH"!(X’x)"'(H’)"'H'C]'CB 
= (CB) [C(X’X)'C’)'CB. 


Show that the transformation z = cy also leaves F unchanged. 


Ou 


8.19 (a) See Section 2.14.3 .:— = CP. Setting this equal to 0 gives Cp.=0. 


8.20 


8.21 


8.22 


OA 


(b) u=y'y—yXB- BX'y+ B'X’XB+ ACB 


= yy — 2p'X'y + BX'XB + NCB, 
Ou / / / 
=a 0 — 2X'y + 2X/XB+C'A. 
Setting this equal to 0 gives 
B. = (XX) 'X'y — (X’X)'C'A 
= B-1UX'X)'CA. (1) 


(Co) CB, = CB — 1C(X'X)"'C'A = 0, 


A = 2(C(X'X)"'C']'CB. 
Substituting this into (1) in part (b) gives the result. 


B.X'XB. = B.X'X(B — (X'X)!CTCX’X) CCB} 
= B.X'XB— B.C[CX’'xX) 'C]'cB 
= B.X'y — 0'(C(X'x)'C} CB. 
Show that B.C’ = 0. 
Substituting B, in (8.30) into SSH = B X'y — B,X'y in (8.31) gives 


SSH = BX’y — (B — BC[C(X’X)!C)]'C0X'X) |} X’y 
= BX'y— BX'y + BCICX’X)'C] CB, 


since B = (X’X)"!X’y. 


In Theorem 8.4e(ii), we have 


cov(B,.) = cov{I — (X’X)"'C’'[C(X’'x)'C}"'C}B 
= cov(AB) = A cov(B)A’ = 0 A(X’X)!A’. 


Show that A(X’X)!A! = (X/X)7! — (X’X)-!C'[CCX’X) 1) CX’X) | 


564 


8.23 


8.24 


8.25 


ANSWERS AND HINTS TO THE PROBLEMS 


Replace B by (X’X) 'X’y in SSH in Theorem 8.4d(ii) to obtain SSH = 
[C(X’X)'X’y — t)[C(X’XK)'C’]'[C(X'X) 'X’y — t]. = Show that 
C(X’X)_'X’'y — t= C(X’X) 'X’[y—XC(CC’) 't], so that SSH becomes 
SSH=[y—XC'/(C’'C) 't]/A[y-—XC(CC’) 't], where A= X(X’X) 'C’ 
[C(X’X)'C’]" 'C(X’X)'X’. Show that SSE=[y—XC(CC) 't//Bly— 
XC'(CC’) 't], where B = I—X(X’X)_'X’. Show that AB = O. Show that 
y—-XC'(CC’) 't is N,[XB—XC(CC’) 't, oI]. Then by Corollary 1 to 
Theorem 5.6b, SSH and SSE are independent. 


See Section 2.14.3. Follow the steps in Problems 8.19 using 


u=(y — XB)'(y — XB) + A(CB- b 
= yy — 2B'Xy + B’X’XB + A(CB - 6). 


Differentiating with respect to A and B, we obtain 


Ou 


OA Cp -t, 
aes 0 — 2X'y + 2X’XB+C'A 
Op : 


Setting those equal to 0 gives CB. = tand 
B. = B-YX'XY ICA. (1) 


Multiplying (1) by C and _ using CB. =t gives A=2 [C(X'X)! 
C’ V (CB-t. Substituting this into (1) gives the result. 


By Theorem 8.4d, we can use the general linear hypothesis test. Use a’= 
(0, ...,0,1) in place of C in (8.30) to obtain 


B. = B- (XX) ala’(X’X) la} 'a’B 


oy (X'X)"laa’B 
Skk 


By (2.37), (X’ >. 6 ie ais a linear combination of the columns of (X’X)~ ' Thus 


a. — p— 22, 
8kk 


where gj; is the kth diagonal element of (X’ x)! and g, is the kth column of 


8.26 


8.27 


8.28 


8.29 


8.30 


ANSWERS AND HINTS TO THE PROBLEMS 565 


(X’X) |. Substituting this expression for B. into px’'y — B.X’y, we obtain 


, a} “y a} B 
BX'y —- B.X'y = BX'y - (2 X’y - a Xx) 


kk 


A a2 
_ BB _ Bi 
Skk Sik 


since g, X’y is the kth element of B. 
a’ B—a'B 
P| la)22-k-1 <= 
sy/al(X’X)!a 


Solve the inequality for a’. 


> la/2n—k J=1-e 


In the answer to Problem 7.17b, we have 


1 = x’ X -ls =I. x’ x —l 
ni ae x; ( C c) xX] —x; ( C a) 
—(X,.X.)'X) (XiX.)! 


(xX)! = 
where X; = (X1,%2, ... ,X,)'. Using this form of (X’ y= show that 
1 
X)(X'X)'xo = (1, xp, )(X’X)! ( ) 
Xo1 
1 = \vy!l -1 = 
=— + (X01 — X1) (XX) (X01 — 1). 


In this case, X} —X;=x9—* and 


Xx} —X 

Xz —X 
—— 

Xn — X 


E(yo — 30) = EQo — x) B) = x), B — x/,B = 0. By (8.59), var (0 — io) = 07 [1+ 
x)(X'X)'xo]. Therefore, (yo —50)/y/[1 +x}(X’X)"'xo] is M(O,1) by 
Theorems 7.6b(i) and 4.4a(i). By Theorem 7.6b(ii), (7 —k — 1)s* if o is 
V(n —k—1). By Theorem 7.6b(iii), jo and s° are independent. Use (5.33) to 


show that t = (yo — Jo)/s4/1+ x) (X'X)!xo is distributed as t(n — k — 1). 


(a) Show that Ej — yo) = EG - xB) =0 and that var(¥p — Jo) = 
o[1/q+xo(X'X)'xo]. For the remaining steps, follow the answer to 
Problem 8.29. 


566 


8.31 


8.33 


ANSWERS AND HINTS TO THE PROBLEMS 


Invert (take the reciprocal of) all three numbers of the inequality (which 
changes the directions of the two inequalities) and multiply by (n — k — 1)s?. 


Let yp = Oo, You, --- od). be the vector of d future observations, and let 


Xo 
Xi = 
Xoa 
be the d x (k + 1) matrix of corresponding values X01, X02, .. . .Xo¢- Show that 


Yo — XuB is Na(0, o2Va), where Vy =1a+ X4(X'X) |X’, and X is the X 
matrix for the original n observations. Show that 


(Yo — XuB)'Vz'(¥o — XuB) 


ao is F(d,n—k—1) 
S 


[for the distribution of the numerator, see (5.27) or Problem 5.12e]. By 
Theorem 8.5 and (8.71) with k+-1 = d, we have the simultaneous intervals 


— sy da'V,!aF adntt Sal(Yo — XaB) < sy/da'V!aF ants 


which hold for all a with confidence coefficient !—a. Setting 
a, = (1, 0,..., 0), ..., a, =(0,..., 0, 1), we obtain 


xB = sy/adll + xh,(X’X) | xoi|/Faan-k-1 


S yor < xpiB+ sy dll + xhyXX) x0] Fadnk-- 


These intervals hold with confidence coefficient at least 1—a. 


For (8.77), we have 


amnLOo) 9) fo 1 yy 
do? do? 


(2102)"/? 
0 n n 
=55| 5 In 2a xin ¥y/20°| 
n yy _ 
ie oe 
Po me 
_ 


ANSWERS AND HINTS TO THE PROBLEMS 567 


For (8.78), we have 


max L( B,o°) = max L(0,07) = L(0,6 6) 


1 
oe Stier’. 
(206 3)"/? 
= ; 1 : eV ¥/20y'y/n) 
(27)? (y'y/n)"! 


n?/2 e-n/2 


—y'y/265 


For (8.79), we have 
n/2 


(y — XB)'(y — XB) 
yy 


n/2 A 
; _ yy — BX’y 
yy —BX’y+ BX’y 
1 
1+ BX'y/y'y — BX’y 


n/2 


8.34 Expanding (y — XB)'(y — XB), we have v = —(n/2) In (277) — (n/2) Ino? — 
[y'y — 2y'XB + B'X’XB]/20° + ACB. Differentiation with respect to 
B gives the result. 


8.35 From (8.80), we obtain 
Bo = (XX) | Xy + GB(X'X) ICA. (1) 


Multiplying Bo by C gives CB) = CB + &2C(X'X)'C’A. By (8.77), 
CBy = 0, and we have 


CB 


h=-[CQX’Xy CTS. 
% 


Substituting this into (1) gives 


By = B— (XX) 'C'[CX’'X) CT 'CB, 


where B = (X’X)7!X’y. 


568 ANSWERS AND HINTS TO THE PROBLEMS 


8.36 Substituting (8.83) into (8.84) gives ' 
(y — XBp)'(y —XBy) = {y-xp+ X(X’X)I1C'[C(X’X)'CT'cp} 
x {y-XB+X(X'xy 'C[CX’xy 'c} 'cp} 
=(y—XB)(y— XB) +0+0+ (CB) [CX’X) CT 'CB. 
Show that the second and third terms vanish and the fourth term is equal to 
(CB) [C(X’X)-'C'}"'CB as indicated. 
8.37 (a) 
Source df SS MS F p Value 


Due to B, 4  2520.2724 630.0681 84.540 7.216 x 10-5 
Error 27 ~—- 201.2276 7.4529 
Total 31 —2721.5000 


(This p value would typically be reported as p<.0001). The F value can also 
be found using (8.23): 


R2/k 9261/4 


eat a R)/(n—k—1) (1—.9261)/27 _ ce 


(b) For the reduced model y; = Bj + B>xi2 + Bixi4+e;, we obtain 
B Xiy — ny” = 2483.1136. From the analysis of variance table in 
part (a), we have px'y — ny’ = 2520.2724. The difference is 


B’X'y _ B X\y = 37.1588. By (8.17), we have 


__ 37.1588/2 


=. = 2.492 
7.4529 as 


with p = .102. 
(c) The values of tj = 6 / S\/8ij in (8.39) are given in the following table: 


Variable ic} 5 /8ii tj p Value 
a — 0.0286 0906 —0.316 755 
x2 0.2158 .0677 3.187 .00362 
x3 —4.3201 2.8510 —1.515 141 
X4 8.9749 2.7726 3.237 .00319 


Comparing each (two-sided) p value to .05, we would reject Ho: B ; = 0 
for B. and By. Comparing each p value to the Bonferroni value of 
05/4 = .0125, we reject Hp for Bz and By also. 


ANSWERS AND HINTS TO THE PROBLEMS 569 


(d) To test Ho: B, = B, = 128; = 12B,, we write Hp: CB = 0 where 


0 1-1 0 0O 
C=|0 0 1 -12 0 
00 0 1 -!I 


We test Ho using (8.26). For Hoi: B; = Bo, Ho2: B. = 1283, and 
Ho3: B3 = B4, we test each row of C separately using (8.37). For 
Hoa: B, = Bo and B; = By, we use the first and third rows of C and 
test with (8.26). The results are as follows: 


236.3268 /3 
Boi = 10. _¢. 
0 oI BT ee ee 
26.7486 
Hot, Ep gang oe p = 0.0689 
5p jh NTI pane ee 
Rage aeaagg, p=0. 
43.5851 
Hig Pas p = 0.0226 
206.2962/2 
Ho F =— 73 = 13.84 — 0.000072 
i 74529 3.8400 p = 0.0000729 


(e) For v= 27, we have f.925.27 = 2.0518 and t9625,27 = 2.6763. Using 
(8.45) and (8.65) and the values in the answer to part (c), we obtain 
the following lower and upper confidence limits: 


B + tors 8/8 ij B + tooe2s 8\/8ij 
—0.2145 0.1573 —0.2711 0.2139 
0.0769 0.3548 0.0346 0.3970 
— 10.1698 1.5297 —11.9500 3.3099 
3.2859 14.6639 1.5546 16.3952 
8.38 (a) 
Source df SS MS F p Value 


Due to B; 3 13266.8574  4422.2858 65.037 3.112x10 %° 
Error 30 2039.9062 67.9969 
Total 33  15306.7636 


The F value can also be found using (8.23): 


R 8667/3 


BS G=9iG@ek=t, Gseens0 


570 ANSWERS AND HINTS TO THE PROBLEMS 


(b) The values of t; = ic} / 5\/%;j in (8.39) are given in the following table: 


Variable ic} Bij tj p Value 
x 0.7803 0.0810 9.631 1.09x107'° 
X 0.5031 0.1251 4.020 0.000361 

x3 —17.1002 13.5954 —1.258 0.218 


Comparing each (two-sided) p value to .05, we would reject Ho: B, = 0 
for 6, and B5. Comparing each p value to the Bonferroni value of 
.05/3=.0167, we reject Hp for B, and B, also. 

(c) For v = 30, we have to95 39 = 2.0423 and to0g33,30 = 2.5357. Using 
(8.47) and (8.67) and the values in the answer to part (b), we obtain 
the following lower and upper confidence limits: 


6 = tos 8\/8ij B + £00833 5\/8i 
0.6148 0.9457 0.5748 0.9857 
0.2475 0.7587 0.1858 0.8204 


— 44.8656 10.6652 —51.5745 17.3740 


(d) Using (8.52), we have 


Xi BE te/2n-e184/ X(X’X) 1 Xo 


18.9103 + 2.0423(8.2460)V.1615 
18.9103 + 6.7677, 
(12.1426, 25.6780) 


(e) Using (8.61), we have 


XOB + te/one184/ 1 + x4(X'X)! Xo 


18.9103 + 2.0423(8.2460)V 1.1615 
18.9103 + 18.1496, 
(.7609, 37.0599) 


8.39 (a) x)B + to25,158\/ xy(X/X)|Xo 


55.2603 + (2.1314)(4.0781)V .19957 
55.2603 + 3.8849, 
(51.3754, 59.1451) 


8.40 


ANSWERS AND HINTS TO THE PROBLEMS 571 


(b) xhB + torsissy/ 1 +x) (X'X) 1X0 


55.2603 + (2.1314)(4.0781)V/1.19975 
55.2603 + 9.5205, 


(45.7394, 64.7811) 


i 0 1-1 O : A 1116 
(c) Using e=(? 0 2 a) we obtain er 


003366 —.006943 


/ —l~w _ 
a aa fone 044974 


), F=.1577, p=.856. 


(a) BX'y — ny? — (BX! y — nf?) = 1741.1233 — 1707.1580, 


5.6609 
= =11006 ra a 
5.1343 O20) Bee 
9741 — 9551/6 
= = 1.102 
1 — 9741/9 ee 
() Bo: 332.1110 + 39.8430 
(292.2679, 371.9540), 
By: —1.5460 + .21109 
(—1.7571, —1.3349), 
By: 1.4246 + 3147 
(1.7393, — 1.1098), 
By: —2.2374 + .7243 
(2.9617, — 1.5130) 
(d) By: —1.5460 + .2668 
(—1.8127, —1.2792), 
By: —1.4246 + 3977 
(— 1.8223, — 1.0268), 
By: —2.2347 + 9154 
(—3.1528, —1.3220) 
(e) 20.2547 + 2.2024 
(18.0524, 22.4571) 
(6 20.2547 + 5.3975 


(14.8573, 25.6522) 


572 | ANSWERS AND HINTS TO THE PROBLEMS 
Chapter 9 


9.1 (a) By (9.5), we obtain 


E(é) = E|(i— Hy] = 1 — HE(y) 
= [I— X(X’x’)'x’]XB = XB — XB. 


(b) We first note that I — H is symmetric and idempotent [see Theorem 
2.13e(i)]. Then by Theorem 3.6d(i), we obtain 


cov(é) = cov|(I — Hy] = d — Hold — HY 
=o (I- HY = cd — BW). 


(c) By Theorem 3.6d(ii), we have 


cov(é, y) = cov[(I — Dy, Iy] 
= (I- H\(o’DI = o (1 — HD. 


(d) cov(é, y) = cov|(I — Hy, Hy] = 1 — H)(o° NH 
= 0 (H — H’) = 0 (H— H) 


(e) 6 = 0", &/n = é'j/n. By (9.4) and (9.5), #j = (1 — Wj = y'G— jp. 
(f) By (9.5), &’y = y(d— Dy. 
(g) By (9.2) and (9.5), é’¥= y((I — H)Hy = y'(H — H?)y = y'(H — Dy. 


(h) By (9.3) and (9.5), &’X = y'(I— H)X = y'(X — HX) = y'(X — X). 


92 (a) gq i ge ho oe Payee I 
—(h—-h’)=1-2h=0, h= ie 
an’ . ae: (5) 


ANSWERS AND HINTS TO THE PROBLEMS 


(b) Let X.= A and (X/.X.)"' = B. Then 


a 
/ 
a, 
ABA! = ; Ba), a2, ..., dn) 

a’ 
n 
/ 
a 
/ 
a, 

-_ ; (Ba,, Bao, ..., Ba,) 
/ 
a, 
a,Ba,; a\Bay ... a‘Ba, 
/ / / 
a,Ba; a,Bay_ ... a,Ba, 
/ / / 
a,Ba,; a@,Bay ... a,Ba, 


(c) tr(H) = tr[X(X’X) 1X’) = tr[(X’X) | X’X] = try 1) =k +1. 


9.3. By Theorem 9.2(iii), hj = (1/n) + (x1 — %1) (XX) 10; — Xj). 
By (2.101) and (2.104), this can be written as 


1 BAS 
hy = —+ (x1 — ¥1)' —a,a, i;—x 
FA (x1; — X1) (333 a «) (x1; — X)) 


r=1 
(x1; — X1)'a,|[a,.(x1; — X1)] 


(x1; — x1)a,]” 


5) 


| 
rida 
1 

aD Bra 


573 


where A, is the rth eigenvalue of X’.X, and a, is the corresponding (normalized) 
eigenvector of XX. By (2.81), the consine of the angle 6,, between x1; — X1 


574 ANSWERS AND HINTS TO THE PROBLEMS 


and a, is 
(x1; — X1)/a, 
cos6j, = a = 
V [Gui — X11 — X1)](a/a,) 
(x1; — X))/a, 


7 Vii — %1)' (1; — X1) 


since a’.a, = 1. Thus, if we multiply and divide by (x); — x1) (x1; — 1), we can 
express hj; as 


— 2. 
(x4 a X1)a,] 


1 
hi a ths be i 
ee wy (x; — 1) x — &1) 
a4 — / = 2, 
== + Gi ~ 51 — 2) Dyes Bir 


9.4 (a) Using (2.51), we obtain 
xx a (X'X)~ 1X’ yy’X(X’X)"! —(X'X)!X’y 
H = (Xy) pee ‘ ooh 
—y'X(X'X) 1 
b b 


where b = y'y — y’X(X’X)"'X’y. Show that b = y'(I — Wy = é’é. Show 


that 
X(X’X)~ | X’yy’X(X’X)" Lx’ 
HY = X(X’xy x’ + S29 “ Sane 
yy'X(X'X) TX! __X(X'X)'Xyy’ | yy! 
b b b 
1 
=H+; (Hyy'H — yy’H — Hyy’ + yy’). 
: 1 
(b) H’ =H+ 5 (ayy — yy’ )H + yy’ — Hyy’] 
1 
= H- ses i Hyy’)(I — H)| 
1 
=H+ 7 — Wyy'd — 
aA al 
EE 
-H+ [by (9.5)]. 


oe 
rc) 


ANSWERS AND HINTS TO THE PROBLEMS 575 


By (2.21), the diagonal elements of €é’ are &1, &5, 
hy. =hj + é /é'é. 


..,€2. Therefore, 


1 
(c) Since H* is a hat matrix, we have by Theorem 9.2(i), — ao hi <1. 
Therefore, (1/n) < hj + 8? /é'@ <1. 


/ ! / / 
9.5 (a) Xx} Xx; xy 
! ! ! 
X xX X 
/ 

xX = ; : (x1, X2, > Xn) i 

! ! ! 

xX, x, x 


= -Lw- = S> xx" + xix; = Xp Xw + ix; 


j#i 
y1 
y2 n 
Xy = (x1, X2, £25 2Ky) , =S> xy 
: j=l 
Yn 


= ss xij + Xi = XY + Xi 
j#l 


(b) B = (XX) 'X’y = (XX) | (Ky + Xi) 


= XX) XY + XX) 'xyi- 


(c) From H = X(X’X)"!X’, we have hj = x/(X'X)_'x;, where x’ is the ith 


ie : 
i 
row of X. Then using the result of part (a) and the inverse in the statement 
of the problem, we obtain 


Be -1 
Boy = XOXO) XON 


= (XX — xix) |X ya 


eee Beats siXi(X'X)! 


4 are 
= xx) i XY: 


576 ANSWERS AND HINTS TO THE PROBLEMS 


(d) From parts (b) and (c), we have 
: (XX) xix(X'X) XH ye 
py 1 i @IO 
Bu = (X'X)” XH¥w + [he 


7 X’X)-!x;x'[B — (XX)! x; 
=f ~ (X'X)Ix,y, + § ) “lB “ ) xi] 


With x'B = 9; and xi(X’X) |x; = h;;, we have 


P (XX) xy; (XX)! x5; 
Bo B 1— hi sae hii 


mei Os 8 
— hi ’ 


=H “(X'Xy! y= 


9.6 By (9.27) and (9.29), we obtain 


Bi) = Yi - x/ Bi =yi-X X)!x; 
If. gj loyly)—1 
= yi — 1B + x X'X 
1 hi 
_ ~ ,  Sihii 
ak as hii 
an éjhii éj 
Ej 
l-hy 1-h; 
9.7 (a) Assuming that X is fixed (constant), we have 
o (1 — hii) 


‘ gj 1 : 
Var E(;7)) = Var = var(é;) = : 
me ( . i) Gage Gay 


98 (a) y= iF = Ly gid} + VF =VOV@ +97- 
- ‘ F gj 
©) YoXoBy = ¥XB— yxiB — XX)" 
ig & — xi(X’X)!x; 
[mh he ty 


= y Xf di s Bx; 
VRBO eg Pt 


ANSWERS AND HINTS TO THE PROBLEMS 


Substituting }; = y; — &;, this becomes 


yil yi — &)C — hi) — €i( yi — i) + Biyihii 


YinXc By as yXB vr 
1 — hj 
~ , —(L — hay? + 8 
= 'X sf 1 U : 
Vee 1 — hj 
a2 
ney 2 yp 2 ej 
(c) SsEy = yy—yf - (yxB-y? +72) 


/ a &j 
=> at ome! . 
yy—yXB i 


9.9 Substituting (9.29) into (9.35) gives 


BP XCX'X)EX’X(X'X) |x; 
~ (= Ay? (k + 1s? 
.. 2 hii 
(1 — hy)? (K + 1s?” 
By (9.25), this becomes 


i 


Observations Yj 9; 8; hi rj tj D; 


29 27.86 1.139.197 0.466 0.459.011 
24 23.76 0.236 = .219 0.098 0.096.001 
26 = =25.88 0.120 .179 0.049 0.048  .000 
22 23.96 —1.961 .289 —0.852 —0.848  .059 
27 28.42 —1419 128 —0.557  —0.550 .009 
21 21.67 —0.672 121 —0.262 -—0.258 .002 
33 31.78 1.222  .053 0.460 0.453 .002 
34 34.22. —-0.218 042 —0.082 —0.080  .000 
32 = 31.98 0.017 = .055 0.006 0.006 .000 
34 = 33.33 0.666 .039 0.249 0.244 .000 
20 21.54 —-1.544 .124 —-0.604 -—0.597  .010 
36 = 32.15 3.846  .040 1.438 1.468 .017 
34 = 33.73 0.271  .072 0.103 0.101  .000 
23, 23.98 —0.982 .191 —0400 —0.394 .008 
24 = 19.71 4.287.418 2.058 2.200  .609 
32 32.84 —0.841 060 —0.318 —0.312  .001 


tll etl eel on oe oe 
NDNNPWNKF TOAWANANADUNFWN KH 


Continued 


577 


578 ANSWERS AND HINTS TO THE PROBLEMS 


Observations Yi 9; &; hi rj t D; 

17 40 40.76 —0.762 .285 —0.330 —0.324 .009 
18 46 44.39 1.614 .493 0.831 0.826 .134 
19 55 52.92 2.083.243 0.877 0.873 .049 
20 52. 52.02 —0.018 .224 —0.007 —0.007  .000 
21 29. 32.38 —3.377 177 = —1.364 1.387  .080 
22 22 23.15 —1.155 .169 —0464 -—0.457  .009 
23 31 3659 —5.586 .227 —2.328 —2.555  .319 
24 45 47.91 —2.909 .185 —1.180  —1.190  .063 
25 37 =: 332.61 4.391 .087 1.683 1.746 .054 
26 37 = 31.89 5.106 .109 1.981 2.103 .096 
27 33 = 30.22 2.775 124 1.086 1.090 .033 
28 27 =31.59 —4.593 102 —1.775 —1.854  .071 
29 34. 3440 —0.399 .068 -—0.151  —0.149  .000 
30 19 19.32 —0.324 .091 —0.124 -—0.122 .000 
31 16 $19.62 —3.623 .102 —1.400 —1.427 .044 
32 22 ~=19.39 2.607  .086 0.999 0.999 .019 


“PRESS=3 10.443, SSE=201.228. 


9.11 Residuals and Influence Measures for the Land Rent data of Table 7.57 


Observations Yj Vj &; hi; rj t D; 
1 18.38 17.332 1.048 0.080 0.132 0.130  .000 
2, 20.00 23.948 —3.948 0.062 —0.494 —0.488  .004 
3 11.50 13.855 —2.355 0.141 —0.308 —0.303  .004 
4 25.00 26.242 —1.242 0.070 —0.156 —0.154 .000 
5 52.50 68.180 —15.680 0.186 —2.107 —2.245  .253 
6 82.50 66.431 16.069 0.083 2.035 2.156 .094 
7 25.00 32.920 —7.920 0.067 —0.994 —0.994  .018 
8 30.67 32.642 —1.792 0.068 -—0.248 —0.244  .001 
9 12.00 7.715 4.285 0.187 0.576 0.570 .019 

10 61.25 57.481 3.769 0.103 0.483 0.476  .007 

11 60.00 50.208 9.792 0.058 1.224 1.234 .023 

12: 57.50 68.846 —11.346 0.100 —1.451 —1.479  .059 

13 31.00 31.768 —0.768 0.076 —0.097 —0.095  .000 

14 60.00 61.864 —1.864 0.067 —0.234 —0.230  .001 

15 72.50 66.773 5.727 0.109 0.736 0.730 = =.017 

16 60.33 66.702 —6.372 0.168  —0.847 —0.843  .036 

17 49.75 59.663 —9.913 0.114 —1.278 —1.292  .053 

18 8.50 10.790 —2.290 0.192 —0.309 —0.304 .006 

19 36.50 24.643 11.857 0.068 1.489 1.522 .040 

20 60.00 65.606 —5.606 0.181  —0.751 —0.746 .031 

21 16.25 18.016 —1.766 0.505 —0.304 —0.300  .024 

22 50.00 47.424 2.576 0.035 0.318 0.313.001 

23 11.50 19.366 —4.866 0.118 —0.628 —0.622  .013 


Continued 


ANSWERS AND HINTS TO THE PROBLEMS 579 


Observations Yi 9; &; hi Tj ti D; 
24 35.00 38.577 —3.577 0.064 —0.448 —0.442  .003 
25 75.00 61.694 13.306 0.063 1.667 1.702.047 
26 31.56 35.257 —3.697 0.035  —0.456 -—0.450  .002 
27 48.50 24.200 6.300 0.063 0.789 0.784 .010 
28 77.50 69.889 7.611 0.242 1.060 1.062 .089 
29 21.67 22.063 —0.393 0.060 -—0.049 -—0.048  .000 
30 19.75 21.221 —1.471 0.096 —0.188 -—0.185  .001 
31 56.00 48.174 7.826 0.051 0.974 0.974  .013 
32 25.00 41.300 —16.300 0.217 —2.234 —2.406  .346 
33 40.00 26.907 16.093 0.214 1.791 1.864  .219 
34 56.67 56.585 0.085 0.060 0.011 0.010 .000 


PPRESS = 2751.18, SSE = 2039.91. 
9.12 Residuals and Influence Measures for the Chemical Data with Dependent 
Variable y$ 


Observations Vi vi &; hii rj tj D; 


Ll eel ee ee ee wee ce ec 
OANNDNMNPWNKFKTDOANANIAADAUNHPWNKE 


45.9 49.34 —3.442 0.430 —1.118 —1.128  .235 
53.3. 54.51 —1.211 0.310 —0.358 —0.347  .014 
57.5 53.46 4.039 0.155 1.078 1.084  .053 
58.8 56.56 2.238 0.139 0.592 0.578  .014 
60.6 56.04 4.559 0.129 1.198 1.271  .053 
58.0 59.14 —1.143 0.140 —0.302 —0.293  .004 
58.6 57.51 1.094 0.228 0.305 0.296  .007 
52.4 60.61 —8.208 0.186  —2.231 —2.638  .258 
56.9 56.30 0.598 0.053 0.151 0.146 .000 
55.4 60.35 —-4.947 0.233 -1.385  -—1.433  .146 
46.9 52.26  -5.356 0.240 —1.507  -1.580  .179 
57.3. 57.77 —0.467 0.164 —0.125  —-0.121  .001 
55.0 54.84 0.163 0.146 0.043 0.042  .000 
58.9 59.40 —-0.503 0.245 —0.142 —-0.137  .002 
50.3. 53.20 —2.900 0.250 —0.821 -—0.812 .056 
61.1 58.15 2.950 0.258 0.840 0.831 .061 
62.9 58.15 4.750 0.258 1.352 1.394.159 
60.0 56.41 3.592 0.217 0.996 0.955  .069 
60.6 56.41 4.192 0.217 1.162 1.177.094 


“PRESS = 416.039, SSE = 249.462. 
Chapter 10 


10.1 


Since (v; — ¥)' = G; — Y, X11 — X1,-.-.Xik — Xx), the element in the (1, 1) 
position of (v; — ¥)(v; — ¥)' is (9; — ame When this is summed over i as 
in (10.13), we have ~"_, (yi —¥)? = (n — Dsyy as in (10.14). Similarly, 
the (1, 2) element of (v; — V)(v; — ¥)' is (y; — yi — X1), which sums to 
(n—1)sy,, and the (2, 3) element of (v,;—V)(v;— vy is 
(xi. — X1)(X%j2 — X2), Which sums to (n — 1)sj2. 


580 ANSWERS AND HINTS TO THE PROBLEMS 


10.2 By a note following Theorem 7.6b, f& and S are jointly sufficient for pw and 
>, if the likelihood function (joint density) in (10.11) factors as 
L(p, X) = g(f,8, w, Y)h(V1,V2,...,Vn), where v;=(yi,x;), as in the 
proof of Theorem 10.2a. Noting that a scalar is equal to its trace, we write 
the exponent in (10.11) in the form 


n 


Yow — WX "wi — w= Doi — WIE" — w 


i=l i=1 


=r] 'S > w— wi pI. 


i=1 


Adding and subtracting Vv, the sum becomes 


n n 


So wi - Wi -— wl = SO - 9+ 9- wi — V+ 9- py 


i=1 i=1 


(vi — V)(vi — VY + nv — py(v — py 
i=l 


=(n—1)S+nW— pv — py. 


Show that the other two terms vanish. Then show that L(t, %) can be written 
as 


1 trs-'s)+-n@7— py"! 2 
(Gn sy= on [im Dtre'8) ney" w)/2 
( (Amy Vy n/? 


S 


103 peo = (* ie NC ) 
0 D, Tyx Ry 0 D, 
Sy SyTyx sy 0 
~ ea neal a) 
= ( SyKyDx 
sD), D,RyD, 


10.4 Express y and w in terms of (2) as follows: y = (1,0,...,0) 


) = W(2).w= 0.0,.25)(2 4 constant = v(2)+ constant. 
x x » x x 


Then use (3.42) and (3.43) with & partitioned as in (10.3). 


ANSWERS AND HINTS TO THE PROBLEMS 581 


10.5 Express w and y as w = (0, #)(2) and y = (1,0,... (2). Then 


cov(y,w) = cov] (1.0,....0(7), o.a(7)| 
x x 


ig ee Oyx eck Qa 
0 
or (yy, (0) = oa, 


2 [cov(y, w) (a Oyx)° 


”” ~ var(y) var(w) Oy (YO) 


(1) 


Differentiate pe with respect to @ and set the result equal to 0 to obtain 

a= (a> ,,a/a CyB Ox, with can be substituted into (1) to obtain 
-1 

MAX «Py, = Oy Oyx/ Oy. 


10.6 Show that for % partitioned as in (10.3), (2.75) becomes || = |2xx| 
(dy — AD Gyx). Solve for CS oy, and substitute into (10.27). 


10.7 y y 
a, = cov(u, Vv) = cov(ay, Bx) = cov a0, Shs o( ie (0, B( )| 
x x 


eo.) (0 
= asite Pee = ao',B’ 
(a, 0, ,0) ( Oy Dex ( B’ ) a yx > 


wy = cov(Bx) = BY,,B’, 
Cie =A Oop 


ae 0,30! 6y _ 401,,B'(BE,.B') Boy, 
ulv 


Cun a Oyy 
—lw-—lp-— 
_ @o1,B(B)'S,,'B Bo, 


2 
a? Oyy 


582 ANSWERS AND HINTS TO THE PROBLEMS 


10.8 y-w=y- by - 3. (x = H,) 
= (1, —o1,.2 5) ( ”) + constant 
x 
=a oO) + constant, 
x 
s-00(2)-#(2) 
x x 
cov(y — w, x) = a’ SB’ = (1, | oy >) (| ) 
Cy. Bee FMT 
0’ 
= ( — O25 Opn, Fy — 0,2 % x) ( : ) 
=. 
10.9 (a) By definition, P= [81 1 — DO — P/E 1 — 
7, Gi — 3)”. Show that § = y by using X’ = @ ) in X’'xB =X’y 
1 


to obtain j/XB=j'y, from which, S™, 3; = 37, y;. Show 
that 71 Oi — DOi — 9) = Vy vidi — 0 = y'¥ — ny? = y'XB — ny. 
Show that 57,6; — 3)” = 30,3? — ny’ =y'y — ny’ = BX’XB - ny = 
B’X'y — ny*. Use (7.54). 

(b) This follows directly from estimation of (10.25) and the expression 
following (10.26): 


y,a’x 


10.10 r a (Syax)” /so52.,. Express y and a’x as y= (1,0,....0(2) and 


a’x = (0, n(2 i By analogy with (3.40), show that syax = 


(1,0,..., os(?) = S.A, where S is partitioned as in (10.10). Similarly, 


by analogy with (3.42) show that s2,=a'S,a. Solve 


(0/ da)(s\,.ay° /s,al S,,a = 0 for a and substitute back into Fal above to 


x 


show that mMaXa/’, ax = R?. 


10.11 Substitute (10.20) and (10.21) into (10.34). 


ANSWERS AND HINTS TO THE PROBLEMS 583 


10.12 Adapt (2.51) to obtain 


a b\ | 1/d —b'Co! /d a) 
b C}) — \-cC'b/d C!4+C'!bb'C !/d )’ 


where C is symmetric and d = a — b/C~'b. Apply (1) to R partitioned as in 


(10.18), 
1 gr 
= yx 
R- e ee: 
Then 
ge 1 
d a—v/C'b 
> 1 1 
Ir Rory  1auRe 


10.13 Use (2.71) to show that for S partitioned as in (10.14), (2.75) can be adapted 
to the form |S] = |S,.|(syy — 8,85, $x). Solve for SS. Sy), and substitute 

into (10.34). 
10.14 As in Problem 10.4, define «= ay and v= Bx, so that o/,, = ao',,B’, 
Yw = BY..B’, and oO, = a? yy. Then by Theorem 10.2c, the maximum 
likelihood estimators of these are s/,, = as,,.B’,S,, = BS,,B’, and 


Sic WS, respectively. Substitute these into 


R- = 8/9) Suv 
uv Suu 7 
10.15 , 1 S70, i /2 
L(ft, X0) = e fv y . 
(V2my" (S|? 
Using v; = el p in (10.9), amd Xp in (10.45), show that L(jt, Xo) 
becomes : 
1 = rt _—y)2 /2 
L(jt, Xo) = - nla’ dint OW /20%y 
(V2my" oh 
St av D yaw, 


The first factor is maximized by G,,, and the second factor by Dd: Show that 
when these are substituted in L(t, Xo), the result is given by (10.47). 


584 ANSWERS AND HINTS TO THE PROBLEMS 


n—-1 n—1 ket a n—1 ‘ 
zs s|= ( ) isl, eal = ( ) Sex. 
n n n 


10.17 Multiply by 1//n — 3, subtract z, multiply by —1 (which reverses the direc- 
tion of the inequalities), then take tanh (hyperbolic tangent) of all three 


10.16 x 
[= 


members. j 
a3 0 0 
10.18 (a) V= 0 m3 0 
0 0 ! 


n3—3 
(b) Using Theorem ‘4. 4a(ii) and (5.35), [C(z— p | [Cvey | 
[C(z — w,)] is X° (2). 
(c) Calculate wu = 2/C'[CVC']"'Cz. Reject if u > Gra 


10.19 The sample covariance matrix involving y and w can be expressed in the 


form 
2 / 
S= So Sie 
Syw Siw 


and s,,, and S,,,, can be further partitioned as 


Syx Sx. Ss 
Sy = é ) and Swy = ( a ) : (1) 
yz 2X z 


By (10.34), the squared multiple correlation of y regressed on w can be 
written as 


rel 
R _ Sy SivwSyw 9) 
yw 2 . ( ) 
Sy 


Using (2.51) for the inverse of the partitioned matrix S,,,, in (1), show that 


1 

/ el a 2 / el ! gl i el 

Sy SsrwSyw as 2 (S7.,8),Syy Syx a 8).Syx S2x8_,S Syx 
Bx 


rel / gl 2 
— SyS2Sxx Syx — Syz8y,Syy Sex + 552) 


1 x 
2 .2p2 2 
- 2 [s.r + (Bi .Syx a Syz) ? 
ax 


2 2 e7¢1 RB —~ole ; : 
where sz. =s7 —S_S. 52, and B,, = S\ Sx is the vector of regression 


10.20 


10.21 


10.22 


10.23 


10.24 


ANSWERS AND HINTS TO THE PROBLEMS 585 


coefficients of z regressed on the x’s. Then show that (2) becomes 


; 2 
2, (BySyx — Sy) 


R= R2.+ (3) 
yw ye sese(1 — R2.) 


Simplify (3) to the correlation form shown in (10.58). 


If z is orthogonal to the x’s, then s,=0. Show that this leads to 
Fyy = Oand R°. = 0. 


For a linear function B) + Bx, the mean squared error is given by 
m = E(y — By — Bx)’. Adding and subtracting , and P’ p, leads to 


m = El(y — fy) — (By — My + Bim) — BY - B,))- 


Show that this becomes 


m = 05 + (By — by + BY BY + BL 2xB: — 2B, Oye- 


Differentiate m with respect to Bo and with respect to B, and set the results 
equal to zero. 


Follow the steps in the answer to Problem 10.19 using y,x,S,., and s,, in 
place of fy, Mr, Xxx, and o,, and using a sample mean in place of 
expectation. 


From the expression preceding (10.71), we obtain 


So wai = S> O11 — F102 — ¥2) — Bro 9 1 — 103i — Js) 
=I i i 
— Bu 2 (y3i — ¥3)027 — Fo) + Bur Br2 3 (y3i — J3)°- 
Using (10.67) and (10.68), this becomes 
Yo wwe = S25 Ou — YO — Fa) — BrBu D> Ow — W3P 
=I i i 
— Bi By De (31 — 3 + Bi Bir 3 (y3i — ¥3)°. 


Follow the steps in the answer to Problem 10.23. 


586 ANSWERS AND HINTS TO THE PROBLEMS 


10.25 Denote yi — ¥, by y;,,k = 1, 2,3. Then by (10.76), we obtain 


ii — BiB. Dit 
J (Six? — BA Di9%) (Dis? - BB Div) 


Mwiw. = 


Substituting for B,, and > from (10.67) and (10.68), we have 


Divina Ge) Ge) Cow) 


Twiw. = 


Da = (Ree) (iy) 5 a = os ‘) (iii q | 


Dividing numerator and denominator by \/>°; yi? )¢; y37, we obtain 


ip whi ei Vi 
if VE oe al SP me ee 


Twiyw, = 
SE. Ope | |SSek Oop” 
de) Dada eel (Due 248 Lee 


12 — 11363 


~ Jara —) 


10.26 By (10.78), 


» ly; — 9,00] = Sy, —y — SS. '(x; — 9] 


i=l 
= D9) - SSa' Di — 9) 


=0-S,,S,'0. 


ANSWERS AND HINTS TO THE PROBLEMS 587 


10.27 By definition, the partitioned S can be written as 


_ Sy Syx 1 n y; ; 7 : 
(2 5) S210)-OlC)-Ol 
a=! Vid Vi e= ¥ 
eee ed 

= (2 to - nee - 90 


__! se | ¥i-YO-Y YO - NG — | 


— n— 11x — ¥¥, YY (K — DOK — 
10.28 271.9298 10.0830 1.4011 2204 
(a) S.= | 10.0830 1.4954 0.0514], sy =|} .0220 |, 
1.4011 0.0514 0.0074 0017 
—0.0212 
B, =| 0.0143 |, By =.2659, s* = .004978 
4.1781 
1.000 0.500 0.990 151 
(b) Ry = | 0.500 1.000 0.490], nx = | .203 |, 
0.990 0.490 1.000 328 
—3.960 
B, =| 0.198 
4.052 
(c) R? = .3639 
R’/3 
F =——___ = 2.8604 = .072 
(d) RD /i5 86045, p=.0 


10.29 (a) x2:7, = 5966, rp = .0721 z, = .6878, 2 = .722, v = 2.0642, limits for 
p1 are .2721 and .7992, limits for p2 are —.3325 and .4543. 
(b) x3: 7, = .7012, rp = .8209, z; = .8697, z2 = 1.1594, v = —.9716, limits 
fo p; are .4309 and .8561, limits for p) are .6301 and .9182. 
(c) x4:71 = .0400, r2 = .0714, z) = .04002, z= .07154, v= —.1057, 
limits for p; are —.3528 and .4208, limits for p. are —.3331 and .4537. 
(d) x5:71 = 3391, ro = .2683, 271] = .3531, 22 = .2751, v = .2617, limits for 
pi are —.0555 and .6421, limits for p2 are —.1418 and .5999. 


588 ANSWERS AND HINTS TO THE PROBLEMS 


z Ay. Nyz Ry, oe ak Re F p Value 
10.30 

Bal 221. 0228 .9808 3010 7.099 .018 

Xo .055 0413 .2513 .0292 .689 417 

x3 149 0518 .9806 .3193 7.531 O15 
10.31 


(a) To find 7,1.23, the sample covariance matrix is partitioned as 


.0078 .2204 0220 .0017 

| .2204 ~—- 271.9298 | 10.0830 1.4011 te =) 
.0220 10.0830 | 1.4954 0514 ; 
.0017 1.4011 0514  .0074 


where y = (y, x,)’ and x = (x2, x3)’. From §,,, §,., S,,, and S,,, we obtain 
D.= .0856 0 R.= 1.000  —.567 
a 0 2.2846 J]? ~*~ \ —.567 1.000 /” 
Thus r,1.23 = —.567, as compared to ry; = .151. 


(b) For ry2.13, we have y = (y,.x2)' and x = (1%4,%3)', 


.0078 0.0220 0 1.0581 
Sy = > D, = > 


0220 1.4954 0722 0 
7 ( 1.000 ste 
** ~~ \.0.2097. 1.000 /' 


(c) For R,, corresponding to y = (y, x1, x2) and x = x3, we have 


0078 = -.2204-—S 0220 
Sy, = | .2204 271.9298 10.0830 |, 
0220 10.0830 1.4954 


.0861 0 0 
D; = 0 2.3015 0 
0 0 1.0660 
1.000 —0.546 0.108 
R,., = | —0.546 1.000 0.121 


1.108 0.121 1.000 


ANSWERS AND HINTS TO THE PROBLEMS = 589 
Chapter 11 


11.1 (B— )'V'(B— #)+(y—XB)(y— XB) + 6. 

= BV 'B-2B'V'6+V 'b+y'y—26'X'y + BX’'XB+ 4, 

= B(V-!+X’X)B— 2p6'(V-! 4 X’xXy(V_! + X’X) (Vb + X’y) 
+(V-'b+X’yy (V1 + X’X) (V1 + X’Xy(V + X’X)! 
x (Vo b+ X’y)— (Vb + Xyy(VE +E X’X) (Wo! + X’X) 
(V'1+X’X) (V1 h4+X y+ dV 'bt+yy+6, 

= BV.'B-2BV.'6,+ 6Vi',- Ni", 

+P'V'b+yy+6, 
= (B— b'V.'(B— b,) + bes. 


11.2 | fe 'dt = b~ | (bt)*e at 
0 


II 
‘i 
Q 
= 
> 
ae 
Es) 
S| 
Sy 
Q 
—~ 
> 
Ne 


=p ey | s“e ‘ds (letting s = bt) 
0 
= bY (a + 1) [by definition of [a+ 1). 


11.3 (a) Use (2.54) with A= I, P= XV, B= V |, and Q = VX’. 
(b) (I+ XVX’)'X — X(X’/X + V7!) |v 


= [I— X(X/X + V7!) I XX — X(X/X4+V!)'V-! (Problem 11.3a) 
=X-— X(’/X4+ Vv!) 'X’K— xX(X’K+ Vv!) !v 


=X—-X(¥/X4+V7!)'Q®XK4+ V4) 


590 ANSWERS AND HINTS TO THE PROBLEMS 
() Vi'-V'&’x+Vvy'v! 
= [V+ (X'X)!(X'X)(X’X)"']7! [use (2.54) in reverse] 
= [(X’xy!+vyy! (simplify) 


= X’X — X’/X(X/X + V-!) 'X’XK [use (2.54)] 
= X’/I— X(/X+ V')"'x']x (factor) 
= X/(I+ XVX’) 'x (Problem 11.3a). 


114 yy+¢V'b-V_'6, 

= (y — Xo)'(1+ XVX’)(y — X) — (y — Xp) 
x I+ XVX') l(y-Xo)+yy+ b'V'b— dV" 6, 

= (y — X@)'(1+ XVX’) |(y — Xd) — y+ XVX’) ly + 2y'(d+ XVx’) ! 
x Xo— f'X'(1+ XVX') 'Xh+yy+¢V'd-(X'y+V'oy 
x X/X+ V1) 1X X4 VIX’ X 4 VO) X’y t+ Vp) 

= (y —X@)\(1+ XVX') l(y— Xd) + y'[I— (1+ XVXx’)! 
— X(X’X + V-!) I X’]y + 2y' [0 + XVX’) 1X — X(X’X4+V'>'Vv lb 
+ @[V'—x’'d4+ Xvx’)) 'X—-Vvl(X’/X+ V1) !v- ob 

= (y — X@)\(1+ XVX’)|(y — Xd) + y'Oy + 2y'06 + H'Od 
(see Problems 11.3a, b, and c) 


= (y — X@)(1+ XVX’) '(y — X@). 


11.5 The prior density for B is p,(B)=c;. Since the prior density for 
In(z~') is uniform, the prior density for 7 is p2(7) = CT |. The like- 
lihood for y|$, 7 is the multivariate normal density with mean XB 
and covariance matrix 7 'I. Using Bayes theorem in (11.4), the joint 
posterior density is 


BB AWSaciar are OOo KO 


= 657" D/2 9 AY-XB)V-XB)/2. 


ANSWERS AND HINTS TO THE PROBLEMS 591 


The marginal posterior density of Bly is 


u(Bly) = cs | glt-2)/29-19-XB) (Y¥-XB)/2 g.7 
0 


= c5I'(n/2)[(y — XB)'(y — Xp)/2)-"”” (Problem 11.2) 
= col(n — k = Ds? + (B= BY (X'X)\(B- BY"? 
(proof to Theorem 7.6c) 


= c7[1 + (B— BY X’X\(B— B)/@—k- sy"? 


=¢7[1+ (B- py [PX'X) "|" B- P/@—k-— pp ee ren? 


which is the density function of the multivariate t-distribution (Gelman, 
et al. 2004, pp. 576-577) with parameters (n — k — 1, B, s?(X'X)"!). 


11.6 Using (7.64), the generalized least squares estimate of 8 for the augmented 
data is 
G9 @ OGG 
I Oo V I I Oo V 00) 
I 0\7/x\]~ I 0\'/y 
w (6 v) (y)P & Mo v) G) 
O V Y O V 0) 


=(XX+V!) '!X’y+V'¢). 


592 ANSWERS AND HINTS TO THE PROBLEMS 


11.7 3 
B= |r = ee dr 
T'(a) 
0 
a 5 fr —6T 
=T | dt 
0 
OF egy . 
= Ta T'(@+ 1) [using prob. 11.2] 
a 
eas: 
—e 
f &* a2 
_ | 2 1-67, (& 
vant) = [7 Tia. edt (5) 
0 
= o i +1 —67 
“al : dr— ( ) 
0 
&* a2 
= —(a+2) 4L9y— (= 
“Tao G 
& Qy 2 
— _2 s-(a@+2) 9) f= 
=Tar ee a 
_ (a+ 1)a@) a2 
a ne (5) 
— ie 


11.8 The density function of zly is given in (11.14). Using the change-of-variable 
technique, the marginal posterior density of oly is 


WO Ly) = C502) OTD eC ENE PAG Hty' y+ 26)/21(07) g2)-2 


= C6(02) PDN NEGA BAGV G+Y 94 28)/2)/o 


11.9 (a) This is the model of Section 11.2.1, with k=1,@= (i): 


oe «(OO . 
0 
aad v= (% ): 


Using Theorem 11.2b and the expression in (11.18), 8,|7 is t-distributed 
with parameters n + 2a, @,5, and w32 where 


Crm +n) dU xii — Di DXi 


Pa = (op? + n\(o7? + 5,7) — OD, x0” 


ANSWERS AND HINTS TO THE PROBLEMS 593 


and 


y'(l— XVX')-ly + 28 (o" +7) 
n+2a (a9? + n)(o9? + 30,97) — (Di)? 


W322 = 


(b) A point estimate is given by dy. and a (1 — w) x 100% confidence interval 
is given by @,) + ta/2n420+22- 


11.10 (a) The joint prior density is 


P(B, 7) = pi(Bl D)p2(7) 
(B-bYV"\(B-)/2 0-1 ,—8r 


= ce 


= 6,1 le BBV (B-b)/2-6r. 


Using (11.4), the joint posterior density is 


8(B, tly) = cp(B, 7) L(B, zy) 
= 651 be“B-BYV'(B- b)/2—51 1/2. 1y-XB)'V-XB)/2 


= cy qh/2+e-1 9 (B- BV '(B-$)+ 1y—XBY (Y—XB)]/2— 87, 


(b) Picking the terms out of the joint density g(B, z|y) that involve B, and 
considering everything else to be part of the normalizing constant, the 
conditional posterior density of B|7, y is 


g(Blt, y) = cs eo (BOY Bb) Ay XB) -XB))/2 


=e ent BV 'B-27 | BV bt PV! bty'y—2p'X'y+ BX'XP)/2 


= C4 CWB AX TV B-28'Xy+-7 Vd) /2 


—ce 7[B (X’X+7-!V-!)B-2B'(X'X4+0 VW yx'X4+7 vy X’yt+-r lv! pb) /2 


7-1 M71 tdi y-l 
= ¢5 67 1BV,'B-2BV, + O,V,'4,)/2 


where V, =(X'/X+7!V')! and o,=V,(X’y+7!V'¢) 
= 65 TB bn) Vi B-4)))/2, 


Hence B|7,y is Ni+1(@,,.7 1 Vn). 


(c) Picking the terms out of the joint density g(B, z|y) that involve 7, and 
considering everything else to be part of the normalizing constant, the 


594 ANSWERS AND HINTS TO THE PROBLEMS 


conditional posterior density of 7|, y is 


W7|B,y) = c6 qil2+a ly T(y—XB) (y—XB)/2+67 


= ce r"/2+e-| gliy—XB) (y—XB)/243]r. 


Hence 7|B, y is Gamma[n/2 + a, (y — XB) (y — XB)/2 + 4]. 
(d) 


e 


Specify 1/s* from (7.23) as a starting value 7. 

For i= 1 to M: 

calculate V,i-1 =(X’Xt qv y1,7 

calculate #, ;-) = Vni-1'y + TAV |), 

draw B; from Ne+1(Py 5-157) Vni-1)> 

draw 7; from Gamma|n/2 + a,(y — XB;)'(y — XB;)/2 + 4], 
calculate 7, ! 


e 


Consider all draws (;,7;!) to be from the joint posterior distribution. 


11.11 (a) Bayesian estimates of B,, B2, and B3 are 0.7820, 0.5007, — 16.6443. 
Lower 95% confidence limits are 0.6281, 0.2627, —42.2511. Upper 


95% confidence limits are 0.9358, 0.7386, 8.9625. 


(b) Answers will vary. We obtained Bayesian estimates of 0.7817, 0.4990, 
— 16.5490, lower 95% confidence limits of 0.6332, 0.2627, —42.6158, 


and upper 95% confidence limits of 0.9358, 0.7358, 9.5144. 
(c 


wa 


lower and upper 95% limits of 2.3505 and 35.7526. 


Answers will vary. We obtained a Bayesian prediction of 18.9113, with 


(d) Answers will vary. We obtained Bayesian estimates of 0.8170, 0.4399, 
—6.1753, lower 95% confidence limits of 0.6831, 0.2142, —21.4675, 


and upper 95% confidence limits of 0.9523, 0.6567, 9.7605. 


11.12 Use Problem 11.2 with t=7, a=(a,,+k+4+2)/2, 


b = ((B— ,)'V.'(B— 6.) + (0 — xB)” + Sex]/2. 


Chapter 12 


12.1 _ Mut By 4 Hal + Bon Mit Big + Mar + Moe 


By, + My, 


2 2 2. 
_ 2(Hu + fy + Mo, + 22) = o5 


and 


ANSWERS AND HINTS TO THE PROBLEMS 595 


12.2 The deficiency in the rank of X does not affect the differentiation of &é in 
(12.10). Thus 


AIA 


a . 
f © = 0—2X'y +2x'/Xp = 0, 
Op 


which yields (12.11). 


12.3. For Theorem 2.7a, the coefficient matrix is A = X’X, and the augmented 
matrix is B = (X'X, X’y). We can write B as X’(X, y). which leads to 
rank(B) < rank(X’) = rank(A). On the other hand, rank(B) > rank(A) 
because augmenting a matrix by a column vector cannot decrease 
the column rank. Hence rank(B) = rank(A); that is, rank(X’X,X'y) = 
rank(X’X), and the system is consistent. 


12.4 (a) 


(b) 


12.5 (a) 


(b) 


We can obtain A’/X’X(X’X) = DN’ from the expression X(X’X)_ X’X = X 
given in Theorem 2.8c(iii). Since A’ = a’X, multiplying by a’ gives the 
result; that is, a’X(X’X)" X’X = a’X implies A/(X’X)_X’X = N’. 

The condition X’X(X’X)-A =A _ follows from Theorem 2.8f, which 
states that Ax = c has a solution if and only if AA’ ¢ = ¢c for any gener- 
alized inverse of A. Thus, X’Xr = A, has a solution if and only if 
X’X(X’X)-A=A. 


a’X = (0,0,0, 1,0,0)X = (1,0,1). X’Xr=A, where r=(0,0, 5). 
Show that X’X(X'X)~A = (1,0, 1)’. These values of a and r are illustra- 
tive. Many others are possible. 

We attempt to find a vector a such that a’X = A’ = (0, 1, 1). Since X has 
only two distinct rows, a’X is of the form a(1,1,0)+ 
a2(1,0, 1) = (a4) + a2,a1,a2) which must equal (0,1,1). This gives 
a, +a, =0,a; =1, and az=1, which is clearly impossible. By 
Theorem 2.8f, the system of equations X’Xr = A has a solution if and 
only if X’X(X’X)-A=A. This is also condition (iii) of Theorem 
11.2b. We find that 


0 1 1\ /0 2 
XXX xy A={[0 1 0}](1])= [11], 
620.9 Ly At 1 


which is not equal to A. 


596 ANSWERS AND HINTS TO THE PROBLEMS 
12.6 Multiply the two sets of normal equations by r’, where r’X’X = ON’: 
rX'XB, = 1'X'y 
r'X’'XB, = r'X’y. 
Since the right sides are equal, we obtain r’X’XB, =1’X’XB,, or 
NB, - N’ Bp. 


12.7 In the answer to Problem 11.5a, a solution to X’Xr =A is given as 
r= (0,0, i). Thus 


y.. 
2: a 
rX’y =(0,0,4)( 9, |] ==. 


3 
y2. 
For NB. we use 
: fe 
B= |y.- b 
Y2,— b 
from Example 12.3.1. Then 
'p ft x x a2 
AB=(1,0,)) ¥1.-# | =A +o. — = Jp. 
Y2,— Bh 
12.8 (a) X’Xr = A is given by 
6 3 3 r| 1 
3 3 0 m |= ]14, or 
3 0 3 r3 0 
6r| + 3ro ale 373 =1 
3r, 7 3ro =] 
3r, file 3r3 =0 


Using the last two equations, we obtain 


ry) = —-'r3 


= 1 
n= 1343; 


ANSWERS AND HINTS TO THE PROBLEMS 


or 


r| —r3 —1 0 
r=[n]= r+ ="; 1)+ 7 5 
r3 r3 1 0) 


where r3 is an arbitrary constant that we can denote by c. 
(b) The BLUE, r’X’y, is given by 


Ve 
r’X’'y = (-c,c + 4,0) Yi. 
¥2: 


= —cy. toy. +51. + Cyr. 
= —c(1. + y2.) toy. +31. + oy, = fy. 


Show that y. = yy. + yo, 


12.9 (a) From Example 12.2.2(b), we have 


yw 
a 
B=| qm ], +a, + B, = (1, 1,0,1,0)8 = a). 
By 
By 
Show that 
422 2 2 y., 
22 011 Yq. 
XX=/2 02114, Wy=] yr. 
2112 0 V4 
21102 yo 


The value r’ = (0, 5s 0, is — 5) gives r’X’X = A}. Then 


597 


For A}B = B, — B = (0,0,0,1,—1)B, a convenient value for r is 
r’ = (0,0,0, 5, —4), which gives r’X'y =5y.1 —5y2 =Y1 —Y. The 
function A,B = a — a2 = (0,1,—1,0,0)B can be obtained using 


r = (0,4, —4,0,0), which leads to r’X’y = $91. —4y2, =i, — Yo, 


598 ANSWERS AND HINTS TO THE PROBLEMS 


b ryt pf. , VA 2 
(b) E(r,X’y) E(+4 2) 


= 4£201 + yi2) + O11 + 21) — O12 + Yn) 
= 4EByn +yi2 + yer — y22) 
= {B+ a1 +B) + e+ ar 

+ By + + a2 + By — (H+ a + Bo)] 
=4(4ut 4a; +4B,) = w+ai +B, =A‘B. 


12.10 (a) The function A’B = (0,c1,02, ...,c) B= aay c;T; is estimable if there 
exists a vector a such that A’ = a’X. The k distinct rows of X are of the 
form x} = (1,0, ...,0,1,0, ...,0), so that 


L 


k 

/ / / 

A =aX= ) aiX; = (5 ener en nat) 
i=l 


Equating this to N = (0,c1, 00, .--5k), we obtain 
ya; = 0, aj = c; i= 1,2, ...,k. Thus >, cj = 0. 
(b) Any estimable function can be found as a/XfB, which gives 


k n 
a'XB =a'E(y) = S_S- ajE() 


i=l j=l 


=> >Ca(ut n= >>| (t+ ya 
ij i j 
= Sout Taj. = wo ai + So aini 


= woot cit 
i i 


where cj = dj. = Gi: Thus 5°;ci7; is estimable if and only if 


ci = 0. 


12.11 In Example 12.2.2(a), part (ii), we have 


xX = 


WW oO 
CoWW 
WoO Ww 


ANSWERS AND HINTS TO THE PROBLEMS 599 


Then for A = (0, 1, —1)’, X’Xr = A becomes 


6r, + 37 +3r3=0 
3r; ry, 3r =1 
3r) oe 3r3 =-l. 


Show that all solutions are given by 


1 0 
r=c| —-l +4 1], 
—l —|l 


where c is arbitrary. Show that r’X’y = y, — ¥,. for all values of c. 


12.12 Use to Corollary 2 Theorem 3.6d(ii) to obtain cov(r}X’y, r,X’y) = 
r',X’cov(y)Xr2 and cov(A} BAS ‘B= a A cov(B)Ad. 


12.13 (a) (y — XB)'(y — XB) = y'y — y'XB — BX'y + BX'XB. Since y’XP is a 
scalar, it is equal to its transpose BX’y. The last term, BX’XB, 
becomes px’ y because X’XB = X’y. 
(b) Using B = (X’X) X’y, we have 


— BX'y = y'y — y'X[(X'X) |x’ 
= yy —y'X(X'X) X’y 


by Theorem 2.8c(ii). 


12.14 p’x'[I — X(X'X)-X']XB = B'X'XB — B/X'X(X'X)-X'XB. 
By (2.58), X’X(X’X)X’X = X’'X. 


12.15 Follow the steps in the answer to Problem 7.21. Is there any step that must be 
altered because X is not full-rank? 


12.16 (a) Since B = (X’X) X’y is a linear function of y for a particular choice of 
(X'X)~, we can use Theorem 4.4a(ii) directly. 
(b) Show that I— X(X'X)-X’ is idempotent. Then use Corollary 2 to 
Theorem 5.5. 
(c) Show that (X'X) X’[I — X(X'X)X"] = O, and then invoke Corollary 1 
to Theorem 5.6a. 


12.17 Since v= UB and XB = Zy, we have A’ B = a'XB = a'Zy = b'y. Thus 
NB= b'y= b'y. Similarly, with 6=VB and XB=W6, we have 
A B= a'XB = a'W6 cS and NB = b= el. 


600 ANSWERS AND HINTS TO THE PROBLEMS 


12.18 24 
oe aed ae oe 
XU’ = eG ‘ 

1 2 

1 0 
it 0 1 1 0 

12.19 = = 

A Nid, Ate Ae (; 1 5) 

lees 


12.20 The normal equations are given by 


4 + 27; TT 27 =:Y3. 
26 +27 =y1. 
2p + 27 = yo, 


Substituting B in (12.39) into the first of these, for example, gives 


Ay 20 V2 Ve) SIs 


4 
= + 2(2: *) 42(2 *:) = 


4 2 4 2 4 
y.ty+y.-y. =y. 
Vc y. 


12.21 a, — ay = 0 gives a; = ay. Substituting this into a; + a, — 2a3 = 0 gives 
2a = 203 =0O0or an = a. 


12.22 Express SSH as a quadratic form in y by substituting B = (X’X) X’y. Show 
that SSH is independent of SSE in (12.21) by use of Corollary 1 to Theorem 
5.6b. Use either C(X’X) X’X = C or X’'X(X'X) X’ = X’. 

12.23 The first normal equation, for example, is 64 + 2a, + 22 + 2@3 + 3B,+ 


3B, =y., which simplifies to 6 =y, when we use the two side 
conditions. 


ANSWERS AND HINTS TO THE PROBLEMS 601 


12.24 

1 1 0 
ree 633) 

M=1, 9 1 |) MX=[3 3 0), B= B, |, 
101 2 O08 By 
1 0 1 
y.. 

XSy= | 
y2 


Then X5X>B) = X4y gives the result in (12.51). 


12.25 (a) 
1 1 0 0 
1 1 0 0 
1 1 0 0 
9 3 3 3 
1 0 1 0 
, 3 3 0 0 
X=/{]1 04107, X¥X= > 
3 0 3 0 
1 0 1 0 
3 0 0 3 
100 1 
100 1 
100 1 
ys 
X'y = Yi. 
2. 
¥3. 
The normal equations are given by 
Oo: 38 33 pb y.. 
3 3 0 0 7 M1. 
= , or 
3 0 3 0 7 Y2. 
3 0 3 0 3 3. 
op + 37, 4+3%43% = y., 
3p + 37; = yy. i= 1,2,3 


602 ANSWERS AND HINTS TO THE PROBLEMS 


(b) Three possible sets of linearly independent estimable functions are 


{e+1, B+tm, b+ 73} 
{8u+ 1 +%]2+73, T—72, TT — 73} 


{“w+M1, TI—72, TT — 73}. 


(c) The side condition 7, + 7 + 73 = 0 gives 


(d) The hypothesis Ho:7; = 7 = 73 is equivalent to Ho:7; — 7 =O and 
7, — 73; = 0; hence Hp is testable: 


3 
al x * 
SS(u, 7) = BX'y = hy. + >> ty: 
i=1 


The reduced model is yj + + €j, the X_ matrix reduces to a single 


column of 1’s, and the normal equations become 
9 =y.. 
y.. 


> 


om 
Hence e 


aly = y 
SS(w) = B)yXyy = yy. = 9° 
(e) 
Analysis of Variance for Ho: 7) =72=73 
Sum of Squares df F Statistic 
2 98 2 SS(r|u)/2 
Le na, alae 
SS(7|u) = oA 3 9 SSE/6 
2 6 
7 2 Ji. 
SSE = Dy ¥y — Lig 
2 8 


oe 
SST = )1 495 - ry 


ANSWERS AND HINTS TO THE PROBLEMS 


12.26 (a) The normal equations X'XB =X’y are given by 


BOWBWwWwWwnnan a 
GFT OWWwHnaond a 


WOW OwWWn OG AW 


6 6 3 3 3 3 
333300 
3 3 0 0 3 3 
6 03 03 0 
0603 0 3)’ 
303 000 
3 00 03 0 
0300 0 3 


V2 


2 2 
124 +6) a; +6> > Bj 4+35_ Way. 
i=1 j=l ij 


j=l 


2 
6 +3 


i=1 


2 
6 +64; +35 > B43 > Hy =i. 


2 
Gi + OB +3) Hy =i 


3 + 3G; + 3B; +34, =yy. i=1,2 j=12 


603 


(b) The rank of X’X is 4. From the last four rows, which are linearly indepen- 
dent, we obtain 


or 


Mat Bit Yu 
M+ a + By + V2 
M+ ay + By + Y21 
M+ a2 + By + Y22 


bMrat+ Bi t+ V1 

a1 — Ar Vii — Y21 
Bi — Bo t+ Yu -— V2 
Yu — Vi2 — Yai + 22: 


(or Qa); — a4 


 Yi2 — Y22) 


(or B, — B, 


+ Yo1 — Y22) 


604 ANSWERS AND HINTS TO THE PROBLEMS 


12.27 (a) The normal equations are 


ne ae ae ae ae y.. 
Ap A OP De DD BN ay YI. 
A AD Oe MN | yr, 
422402 2|/£8 |=] y1. 
422 0 4-23) Be yo, 
Ae DDO Ah yy ya 
A DB BO A VY y.2 


(b) 
w+a+pyt+y 
Qa, — a 
Bi — By 
Nv — V2 


(c) Using the side conditions a, + @ = 0, B, + By = 0, y, + % = 0, we 
obtain # = y_, & =, —Y_, By =Yji—Ys Ve =Ve Yi 
(d) 
al = pe ie 
SS(u, a, By) = BX'y =5_y.. + ¥0 G6, Fi. 


+326, -F.et 564 -F yx 
Jj k 


2 2 2 2 2 2 2 
a. Vien. Deas Yi M., vie... 
pra Sea et ea 


= SS(u) + SS(a) + SS(B) + SS(y). 


Using this same notation, the reduced normal equations under 
Ho : a, = a2 become SS(p, B, y) + SS(w) + SS(B) + SS(y). 


(e) 
Analysis of Variance for Ho: 7 =72=73 
Source df Sum of Squares F 
SS(a|u,B,y) 1  SS(u,a,B,y) — SS(u,B, y) = SS(a) — SS(alu, By) 
SSE/4 


Error 4 SSE= Vix Voie — SSE(u, a, B, y) 


ANSWERS AND HINTS TO THE PROBLEMS 605 


12.28 
8 44442 22 2 
44022 2 2 0 0 
40422002 2 
402 40 2 0 2 0 
xXX=14 220402 02 
2202 02 0 0 0 
2200 2 0 2 0 0 
202 2 0 0 0 2 0 
202 02 0 0 0 2 
Chapter 13 
13.1 ij ii fi ii kn on 
jj jj 0 0 non 0O 
eT OFT. co 05 ae Oe 
ij 0 0... Fj ae ‘ 
DV: i y 
JY Yi. y1 
Xy= jy> = y2 ss. | 2 
iy; Yk. Yk. 
13.2 
O £67 dai 10 y. 0 0 
. A ety OTe care. O Yi. yi/n Vi 
B=(XXyXy=|. . an ee : an ee 
0 0. ... If/n} \ ye. ye./n Ye. 
13.3 


k n 
D2 Ou HP = DID OF - 2H. +H 
7 i=1 j=1 


= rye (4s) +nyos 
ij i j Y 

= yy, Dy soy 
ij i i 


606 ANSWERS AND HINTS TO THE PROBLEMS 


13.4 esac nv (2 2 Gs 0 
aha ve <i 0 I/n 0 0 
0 j 0 
00 : i 3 
} PANO: 20 “Wi + gait A 
j j’ j’ j’ 
j 0’ 0’ 0’ 
’ ! ’ ’ 
{oj 0 0 
0’ 9 0’ j’ 
1j 0 j j’ j j 
0 1j j 0’ 0’ 0’ 
"% 0’ j’ 0’ 0’ 
0 =O 0 J 0” 0 O j 
jj Oo O 
O ij’ O 
= - ; 
oOo... ij 
y [I — X(X’X) X'ly = y'y — y'X(X'X) _X’y 
jj O O\ /¥ 
2 1 / ! ! o ii _ Y2 
oa S yy 7 PAOkE ar 
0 Oo... jj Yi 


k 
= 4-2) yaly, = 94-2 
ij i=1 ij i 


13.5 (a) With af = pw, — pm. in (13.5), Ho: af =a} =---= a; in (13.18) 
becomes Ao: Mm, — B.= My -BMe=s°=eye—-p. or Api, = 

My = +++ My, Which is equivalent to (13.7). 
(b) Denote by a* the common value of a7 in Hp: a} = a =--- = ag in 


(13.18). Then }°; af = 0 give a* = 0, since ey a =ar=0,i= 
1,2, ..., &. Thus, of = 0,i=1,2,..., k. 


ANSWERS AND HINTS TO THE PROBLEMS 


13.6 Pr. ce. ee ee) 
nS~ 6; y= n> Oi — 29,9 +Y_) 


i=1 
=n) yi — 2ny S_y, + kny’ 
oi 2: : 2 
=O) me te) 
2, 


a ee 
= 2 _ 4d... a 
aoe kn Tin 


13.7 See the first part of the answer to Problem 13.3. 
13.9 Using X in (13.6), we have 


0 0 0 0 0 
Oo 1 -il 0 0 0 1 0 0 0 
C(X’xX) xX’ =] 0 1 0 -!il 0 0 0 1 0 0 
0 1 0 0 -!il 0 0 0 1 +0 
0 0 0 0 1 
j, in, 0 0 0)\' 

x1 j, 90 j, 0 0 

j, 9 0 0 j, 
I. Ae ded 


i, i, of ol 


jn 0 0’ -jj, 


607 


608 ANSWERS AND HINTS TO THE PROBLEMS 


13.10 By (2.37), we obtain 


Je dae “de 1 
. —-j, 0 0 
Aj, — . 1 
: 0 -j, 0 1 
0 0 -j, 
jn jn In 
0 0 jn 
3jn 
ot ce al ee cee 
AJ3A’ = Aj,j,A’ = 4 Biv heal) 
“i 


13.11 (a) 
E(e)° = Ele, — 0) = Eley — Ele)” = var(ey) = 0°, 
E(ejeéij) = Ele; — ONE — 0)] 
= Ele; = E(e,)| [eny = E(eq)) 
= cov(éj, E77) = 0. 
(b) 2 


Soo ut + af + ey) 


i=l j=l 


* 2 
EQ”) = e(S-») =E 
ij 


=E 


2, 
ine +n Dah Da 
i ij 


2 
= El kn? pw? + (= 7) +2knp* a Ej 


y i 


= eee + S- &; + = EjjEIm + 2knp* Ss 7) 
ij ij#lm ij 


= kn? w? + kno’, 


ANSWERS AND HINTS TO THE PROBLEMS 


2 
= S- np + not? + (>: +) +2n? w* a 
i 
+ 2np* x €jj + 2n ss “si 
J J 
=E Ge +n a a? + > (ms e+ », EE 7) 


+ 2n? eS a; + 2np™ Se ay + 2ny- Ss” “ni 
i ij i of 


= kn? pw? + n°? S- a? + kno’, 
i 


E[SS(q|)] = e(! du ; i) 


1 
=— (i +n sy a? + ine) — (Kn? pw? + kno’) 
n 


= knp”” +n>- aj + ko* — kn? — 0? 


=(k- Do? +n> a7. 


(c) kon 
e(S- a) = ely: (u" + a} + 8%) 
i 


i=l j=l 


UT] 


609 


p> (wr? + af? + & + 2p* aj + 2p" + 207 €,) 


610 ANSWERS AND HINTS TO THE PROBLEMS 
= elma + n> ae 4 Se, + 2np* > a 
i fi i 
+2" Se ej +2 S csi 
] ij 


= kn? +n oS a? + kno’, 
i 


E(SSE) = E (= ye 1] 
ij i 


= kn"? +n) a? + kno — kn? — n> aj? — ko’ 


=k(n— 1)o’. 
13.13 By (2.37), 
0 0 0 0 0 
| ome Cea | 1 1 1 
Cj,=}] -1 O OFF 1])=]-1]+] Of +] o};F 
0-1 O;\1 0 —1 0 
0 0-1 0 0 = | 
Thus 
0 0 0 
3 3 3 
C'I3 = C'(j3, 53,53) = (C53, Cis, Cj3)= | —1 -1 -1 
=P a =], 
omy es es 
0 0 0 
3 3 31/01-1 0 0 
CiI,C=]-1 -1 -1//01 0-1 0O 
Sheet ea MRO T. re at 
Sal) dt oth 
0 0 0 0 


0 

0 9 —3 —3 —3 
=10-3 1 1 1 

0-3 1 1 1 

0-3 1 1 1 


ANSWERS AND HINTS TO THE PROBLEMS 611 


13.15 Using (X’X) in (13.11) and B in (13.12), we obtain 


13.16 


13.17 


k 
CB = (0,c1,C2, -..,c4) | 92. | = Soom. 
i=l 


Nk. 
0 
0 0 
rom 0 I/n ... ie k 2 
XX) c= ‘des © |= aa 
ERK C= Orcrer,.-.6)] a Mes 
0 0 1/n 
Ck 


Using B = (X’X) X’y, the sum of squares for the contrast cp can be 
expressed as 


(iB? _ BeeiB _ y'X(X'X)-cie\(X'X)X’y 


c(X’X)¢;  (X’X)¢; el(X’X)¢; 


with a similar expression for the sum of squares for cB. By Corollary 1 to 
Theorem 5.6b, these two quadratic forms are independent if 


X(X/X)~ eye)(X'X)X’X(X'X)- eje/(X/X)X’ = O. 


This holds if ¢}(X’X)" X'X(X'X)"¢; = 0, which reduces to ¢/(X'X)~¢; = 0, 
since ¢, is an estimable function and therefore by Theorem 11.2b(ii1), we 
have ¢/(X'X)" X'X = c!. Now by Theorem 12.3c, we obtain 


cov(e)B, ¢B) = 0° ¢}(X'X) gy. 


(Ai) = (viv)! = (W))'V) = viv) = Ai. 


(Ai)” = vivivivi = viv, = A; since viv; = 1. 


By Theorem 2.4 (iii), rank(A,;) = rank(v,v/) = rank(v,) = 1. 
A, Aj = V; V;Vj v; = O because viv; = 0 by Theorem 2.12c(ii). 


612 ANSWERS AND HINTS TO THE PROBLEMS 


J\* Giy _ iii — abn 
eee (an) (abn) (abn ~~ (abn)? abn 
ox, = Aix: = ince Aj = 1 
(b) api xX 1X; = X,, since A; = 
eof 
xr _ey 
abn 


Clearly x; = j is a solution, since j'j = abn. 


os oe 
13.19 je Mandy AP. i 
TOs 4n An j 
Vov sect Se fot i.) = 7 —0 
ov J/4nV/2n Jno Sno Sno Sn 0 . 
0 
ej ; 
13.20 i! _y JX... X10X2 4 
cde Sa jj ms X} oX1.0 on 
= jx — jx. —0 [by (13.66)], 
§ ; 
x’ xX — x! x aia x’ . xX) 9X2 x! x 
1.0%2.01 1.0%2 jj 1.05 x; 9X10 1.01.0 
= X1 9X2 — 0 — X} 9X [by (13.66)]. 


13.21 By (7.97), we have x3012 =x3—Z,(Z,Zi) 'Z\x3, where Z) = 
(j, X1.0, X2.01). Thus 


+s -1 / 
. ; ; 
X3.012 = X3 — (j,X1.0,X201)] O xX) 9X10 0 X) 0X3 
I ! 
0 0 X> 91 X2.01 X> 91 X3 
e/ ! ! 
= JX3 . X19 X3 X> 91 X3 
X37 ae 7 Xi. j X2.01- 
JJ X; 9X10 X> 91 X2.01 


13.22 Using (13.66) and (13.69), we have 


7 ; ; 
JX3 7 Xy 9X3 X91%3_ yy 
7. JJ J X10 
JJ 


= jx; — jx; -0-0, 


e/ of 
J X3.012 = J X3 ; 
X) 9X1.0 


“ , 
! ose JX3 0%3 xX 
X}0%3.012 = Xj 0X3 ii Xo“ TZ 5 NL O*LO — =X) 9X2.01- 


ANSWERS AND HINTS TO THE PROBLEMS 


13.23 Show that the coefficients in (13.70) are given by 


x; 100 
TS "= 95, 
Jj 4n 
! 208 
1.0410 
pa Eas 


X}oiX201 4 
Then by (13.70). 
ZB = X3— 25 j = 10.4x10 = 7.5X2.01 


=(-3,...,—.3,.9,...,.9,-9, ...,-9,.3, ...,.37, 
which we divide by .3 to obtain 
z =(-l,...—-1,3,...3,-3,...,-3,1,..., LY. 
13.24 Zy =X) =j, 


Zi = X10 = 2(x) — 2.59), 


Z. = X2.01 = X2 — 7.55 i 2.5X10 = X2 — 75j a 2.5[2(x1 = 2.5j)| 


=X, +5j—5x, 
x3 — 25j = 10.4x1.0 = 7.5X2.01 
B= 3 
x3 — 25j — 10.4(2x, — 5j) — 7.5(x2 + 5j — 5x1) 


Then XB = Z@ can be written as 


Boj + ByX1 + BoX2 + B3x3 = Ooj + 6121 + 0222 + 0323 


613 


= Oj + (2x1 — Sj) + O2(K2 + Sj — 5x1) 


16.7 
+65 (3) — 35j+ (52) = 25%,| 


= (0 — 50, + 502 — 3503)j 


16.7 
+ 26 —50.+ (7) | xX) 


6 
+ (0) — 2563)xo + (3) X3. 


614 ANSWERS AND HINTS TO THE PROBLEMS 


Thus 
Bo = 0-50, + 50, — 350; 
16.7 
B, = 20, — 50). + (~ Ja 
By = 0 — 256; 
0; 
P3 = 3° 
j jy 
Z LY 
13.25 Zy =(j2.2,ay=] } ly=| "|, 
Zy By 
Zs BY 
jj 0 0 O\ '/jy 
@=(zy'zy=-|° 7 9 2 “iY 
0 0 dn 0 Dy 
0 0 0 2423 “hy 
jy/jj 
_ ZiY/Z,Z1 
DY /DZo 
DY [C323 


13.26 Since the columns of Z are linear transformations of the columns of X [see 
(13.65), (13.68), and (13.70)], we can write Z = XH and Z, = X,H,, where 
H and H, are nonsingular. Thus 


B'X'y — BYX\y = y X(X'X)!X’y = yXi(X, Xi) Xy 
= yZH' (ZH!) (ZA!) |Z Y'y 


— y/'Z,Hy | [(ZiHy')(ZiH)) (Zi Hy )’y. 


Show that this reduces to (zy) /U,Zx- 


13.27 Linear: —3(1) — (2)+2+3(1) =0 


Quadratic: 1—2—2+1=-2 
Cubic: —1+3(2)—3(2)+1=0 


ANSWERS AND HINTS TO THE PROBLEMS 615 


13.28 The orthogonal contrasts that can be used in Ho: cipt; = O in part 


13.29 


(b) are 


2p + 2My + 2Mg — 3My — 3u5 = 0 


2b — M2 — M3 = 0 
M2 — 
M4 — 


M3 =0 
Ms = 0. 


The results for parts (a) and (b) are given in the following ANOVA table. 


Sum of Mean 

Source df Squares Square F p Value 
Breed 4 4,276.1327 1069.0332 8.47 .000033 
Contrasts 

A, B, C vs. 1 211.7289 211.7289 1.68  .202 

D,E 

A, B, vs. C 1 370.6669 370.6669 2.94 .0933 

Avs. B 1 708.0500 708.0500 5.61  .0221 

D vs. E 1 2,885.4545 2885.4545 22.86 .0000182 
Error 46 5,806.4556 126.2273 

Total 50 10,082.5882 


The orthogonal polynomial contrast coefficients are the rows of the follow- 
ing matrix see Table (13.5): 


| 
| 
=f, 
{ <4 


0 1 
aes 
0 2 
6 —4 


ere NN 


The results for parts (a) and (b) are given in the following ANOVA table. 


Sum of Mean 

Source df Squares Square F p Value 
Glucose 4 154.9210 38.7303 29.77 7.902x 101! 
Contrasts 

Linear 1 140.1587 140.1587 107.74 3.168 x 10"? 

Quadratic 1 0.0065 0.0065 0.006 944 

Cubic 1 14.7319 14.7319 11.32 .002 

Quartic 1 0.0241 0.0241 0.021 893 
Error 35 45.5322 1.3009 

Total 39 200.4532 


616 


13.30 


13.31 


13.32 


ANSWERS AND HINTS TO THE PROBLEMS 


The means for the five glucose concentrations are 2.66, 2.69, 4.94, 7.09, and 
7.10. From the F's we see that there is a large linear effect and a small cubic 
effect. 


The contrast coefficients are given in the following matrix: 


2 -1 -l 
0 1 -1/}’ 


The results for parts (a) and (b) are given in the following ANOVA table. 


Sum of Mean 
Source df Squares Square F p Value 
Stimulus 2 561.5714 280.7857 67.81 2.018x107'% 
Contrasts 
Ivs.2,3 1 525.0000 252.0000 126.78 ~—-8.005x 10 '* 
2 vs. 3 1 36.5714 36.5714 8.83 .00505 
Error 39 161.5000 4.1410 


Total 41 723.0714 


For contrast coefficients comparing the two types of raw materials, we can 
use those in the vector (5, 5, 5, 5,—4,-—4, -4, —4, —4). The results for parts 
(a) and (b) are in the following ANOVA table. 


Sum of Mean 
Source df Squares Square F p Value 
Cable 8 1924.2963 240.5370 9.07 2.831x10-° 
Contrast 1 1543.6463 1543.6463 58.18 1.493x10'! 
Error 99 2626.9167 26.5345 


Total 107 4551.2130 


Contrast coefficients are given in the following matrix: 
1 -1 -1 
—1 0 0 
0 O 1 -l 


The results for parts (a) and (b) are given in the following ANOVA table. 


ANSWERS AND HINTS TO THE PROBLEMS 617 


Sum of Mean 

Source df Squares Square F p Value 
Treatments 3 1045.4583 348.8461 6.03 0043 
Contrasts 

1, 2 vs. 3, 1 7.0417 7.0417 0.12 731 

4 

1 vs. 2 1 30.0833 30.0833 0.52 479 

3 vs. 4 1 1008.3333 1008.3333 17.44 .0005 
Error 20 1156.5000 57.8250 


Total 23 2201.9583 


13.33 Contrast coefficients are given in the following matrix: 


1 1 —3 
—2 O 
= 0 O 


The results for parts (a) and (b) are given in the following ANOVA table. 


Source df Sum of Mean F Dp 
Squares Square Value 

Treatments 3 3462.500 1154.167 6.71 .00103 
Contrasts 

1, 2,3 vs. 1 1968.300 1968.300 11.44 .00175 

4 

1, 2 vs. 3 1 66.150 66.150 385.539 

1 vs. 2 1 1428.050 1428.050 8.30 .0066 
Error 36 6193.400 172.039 


Total 39 9655.900 


Chapter 14 


14.100 0) = wy — My = M+ +B, + 1 — (Ut a + B+ Y1) 
05 = My, — Miz — B31 + M32 


14.2. By Theorem 12.2b, all estimable functions can be obtained from bi = 
bt+apt+ B; + Yj- To obtain an estimable contrast of the form 0; Cis 


618 ANSWERS AND HINTS TO THE PROBLEMS 


where )°,c; = 0, we consider 


a a 
So cm = So cig + SS cia + S- ciBj + So cin 
i=1 =I i i 


i 


= s Cj Qj +5 CiYij- 
i 


t 


Thus an estimable function of the a’s also involves the y’s. 


14.3 4(0+ 6+ &) =3(B: — Bot Yu — Vin + Bi — Bo + Yar — Yn 


+ By — Bo + ¥31 — ¥32) 


= }(3B; — 3B) + Yi + Yor + Yai — V2 — Y22 — ¥32)- 


ee) a = 0G, — B= De an, 
i=1 i i 


= 3 BM; ap. _ Mo KM 
: b ab b b 


©) 5 Yj = Yo (uy — BB +B) 
= Do by — Do i — aij + aft 
i, Aj ap 
= Hey ~ a ab 
= Bb; Fe nyt 
14.5 (a) 


Say — SS (ai -a+¥-Y¥2) 
isl i 
= Yai =a) + YO =) 


=a pee 
oe 2, hot, 4 (nots > Vi. Vi 
ee i ae) ate eee cs a ae een 
i=1 i i 
=y, Y;; Zotie nd: 


b a ab’ 


ANSWERS AND HINTS TO THE PROBLEMS 
14.6 (b) Vii = by — Bi. — yt Kb, 


1 b 1 a 1 a b 
j= i= i=l j= 


1 b 
j=l 


1< 1 
=a et a + B+ Wt zd Ut a+ B+ y) 
i=l i 


= B+ a+ Bt Vy HO aw Sar 
i j 
A Da--lRwtetle 
1 1 
+p t apo W 


= VT MOET VE 


14.7 E(&) = BG, - 3.) = E(5*) - E(=) 


_—FE ja Viik E ik Dik 
bn abn 
EO) — Lie EO) 
bn abn 


_ oe ue" + oF + BF + ¥%) Die (HE + af + BF + Yj) 


bn abn 
__ bny* + bro; +n, Bj +2 %; 
bn 
abn” + bn 1,0; + and), B +n)0i ¥; 
abn 


* 


=p +a; p. 


14.8 (a) For b= 2 and n= 2, we have 
. = 1 1 1 
Jig Sige 5 Yuk + 5 Yi2k = 5 Suk 
k k jk 


1 Vijk _ 
2 ( » 4 V1. 


619 


620 ANSWERS AND HINTS TO THE PROBLEMS 


. An Aj 
T 7 af 
14.9 Write X’X as X’/X = ee rT ), Then 


X'X(X'X)- = Ay) Ap\/O O _ (0 5A 
Ay, 21 Oo 51 Oo i! 
X'X(X’X) XX = (og Ai ) ; 
Ad 21 


Show that }Aj2A21 = Ajj; that is, show that 


D:D ION. IO. Dy O10 
220000/{220002 
f [0202.22 0-0))| 2.0. 20210 
21011000 22) | 2 O20: 00S 
2.0: 2502 0.) |, 2-0 BBD: GO 
0: 2 6.2 0-2 52-8030" 2 
124 4 4 6 6 
A? ADOS0. 9 2 
_) | A042 022 
= sede SONG as, 938 
eae ee 
6 2 2 9076 
14.10 
S— (Cvi — 94.” = D5 Oe — Windy, +H) 
ik ijk 
= yin — 2 D055. Do vie HD 5 
ijk i k ij 
= Yin — 2 > Yg.ndy, +2 DY; 
ijk i ij 


14.11 From Mo, — Bor = B31 — B32, We have 


O = by, — Boo — B31 + M32 
= wt a+ By + Yo — (H+ G2 + Bo + x2) 
— (w+ a3 + By + 31) + M+ 03 + Bo + V32 
= Yo21 — Y22 — ¥31 7 ¥32- 


ANSWERS AND HINTS TO THE PROBLEMS 621 


14.12 


2 
S Germs 8 zs) a ee . 
Oi. — YD. ; abn me —~ bn  abn& me 


a ae 

bn abn’ 
eee ee ee Vy Ven VG. : 
2, Gu Vi. — Vy TYG. ae iat ah aba 


2 
-pt-y(», 7 x 


14.13 (a) Using B from (14.24) and X’y from (14.19) (both extended to general a 
and b), we obtain 


2 
Ai q Viz Vii 
BXy= 5 Vij ij. = S ( EY yy = y ee 
i eae ron 
) Ey =H) = 0% 
n yj, =n ( ) =n - 
a ‘ae Tue 


14.14 In the following array, we see that the ,;’s in the margins can all be obtained 
from the remaining (a— 1)(b—1) y;’s by using side conditions: 


622 ANSWERS AND HINTS TO THE PROBLEMS 


14.15 By (5.1), 3, Gi — 3)” = Ly? — ny". Then 
nS- (5%. — i. I, +9. 


=n (G5; - 9. - Fy; — IID 
: 


=D) Gy, 5.) tan 9. — FY — 2D Fy, — FIG, - 
ij J y 


=n) yj, — bn ly +an) ¥, — abny’. 
ij i J 
Vij. Yi. 
-»| an 7a pale _ 
Yih y 
= pa Ae "iat 2 oy as bn 
Vij. Ve. Yj. Ya. 
oH » (e =) ( = ay 


i) 2 2, D: 
_yeu a8 REN ages Cee 2 5 Ie 
2 n bn De an abn an » Yi b ” b 


ij i Jj 
2 2 9. 
_Vyru AYE RO des Ne 2 2, 2. 
7 » n bn De an abn an ds 3 abn 


2 2 
= 5 Yi yy Ma 
7” bn yan abn 


14.16 By (5.1), we obtain 


SSE = Sie — Hy) = 22 Om Iu 


ijk 


x2 


ANSWERS AND HINTS TO THE PROBLEMS 623 


14.17 Partitioning X into X = (X,, X)), where X, contains the first six columns and 
X> constitutes the last six columns, we have 


peiea a O 0O\/xX, 
XX'XYX’ = $i, X9( ( ) 


oO |! xX, 
x’ 
=, x( =1X)X%. 
2 
We can express X, as 
j 0 0 
0 j 0 
Xz = . ’ 
0 0 j 


where j and 0 are 2 x 1. Hence 5 X2X%, assumes the form given in (14.50). 


14.18 (a) 


| 


DI RD] Ro | Ro | So} 


xX) Xi(X, Xi) = 


I 


| 


NIENIF = OO = 
DR WRR| RU] RS 
ROWER RRR 


NIENIE OO ee 
NIFNIE OR OF 


| 


Multiply by XX; on the right to show that XX) (X-Xi) XX = XX. 
(b) 
—1 
—l 
—1 
—1 
—l 


= —] 
X\(XjX1) = 4 “1 


—1 
—1 
—1 
—1 
—1 


CoCo CoooCoOWWW WwW 
SCO CO OWWWWHNOOC SC 
WWwWwooncdceeaececcece 
CSCONNOCOONNOON YN 
NNOGOONNTCTONNOCOO 


Multiply on the right side by X,’ to obtain X; (XX) Xj in (14.54). 


624 ANSWERS AND HINTS TO THE PROBLEMS 


14.19 We first consider y., and wis 


ya = SS yik i So yuk + S— yo + So yak 
ik k k k 


=ynityonit ya 


j 

0 

J / J / J / j 

= (Vir Yi2> Voi Y22» ¥31> Y32) 0 |’ 

j 

0 
j JO JO J O 
O Oo O00 0 0 0 
2 / j e/ / sf / ef ! J J O J O J O 

= ,O, ,O,7 ,O = 
Yi=V] o j,0,j,0,j,0)y=y Ee eee 
j JO JO J O 
O Oo 0 0 0 0 0 
Similarly 

O Oo O00 0 0 0 
j Oo JO J O J 
pee | O Joe Ios loot wettout O O Oo (@) O O 
O Oo O00 0 0 0 
j Oo JO J O J 


If we denote the above matrices as C, and C;, we have 


2 
EL, =a +b, = by Cry + gy’ Coy = 2y(Ci + Cody. 


Then 


C=C,+C,= 


ono O40 
“On O40 
eououoag 
+Oou O40 
one O40 
“On Oud 


ANSWERS AND HINTS TO THE PROBLEMS 625 


14.20 We show the result of 5A—4B—-iC+4D for the first two “rows”: 
i JOO 0 0 0 ‘ JIJO0o0O0O oO 
*"\o J 000 0/ *\I J0000 
Oo J O 
JO J 


i a ena 
+5 
P\J IJ III 


since 
6 3 2 a) 
I — 735 a3 +55 iJ, 
3 i ee 
O-J-O+7I=— J, 
2 Ly 
O-O-5I+G5 iJ, 
and so on. 


14.21 If at = a} = a} =a", say, then >), a* =0 implies 0 = 7), at = 
we oe Het; or at =O, 


14.22 


a a ob 
SS(u,0, y) = Ay. + 9 aii. + SOS Hy. 
i=1 


i=l j=l 
2 2 2 2 2 2 2 
= Vie Sine a Vij. +i. jet Ma 
=o es] + + ; 
abn ( : bn | (= n : bn Dan Z| 


14.23 As noted preceding Theorem 14.4b, SS(a, B,y|u) = SS(alp, B, y) + 
SS(B|u, a, y) + SS(y| w,a,8), where SS(a, B, y|M) = 0 97,/n—y?/ 
abn. For a=3, b=2, and n=2, we have by _ (14.57) 
and (14.60), SS(a, B, y|w) = y(5A- 4D)y, where 


ey D=Ji2= 


oooooe 
OOOOH 0 
OOoO=00 
oon 000 
one 0000 
“OO000 
yy 
yy 
a 
yy 
yy 
yy 


and J and O are 2 x 2. Show that 4A — 4D is idempotent, so that condition 
(c) of Theorem 5.6c is satisfied. To show that condition (d) holds, note that 


626 


14.25 


14.26 


ANSWERS AND HINTS TO THE PROBLEMS 


the degrees of freedom of }/;; Vy. /n—y? /abn are ab — 1, which is easily 
shown to equal (a — 1)+ (b— 1)+- (a— 1)(b- 1). 
For b = 2, the sum of squares has only | degree of freedom and C has only 
one row. From (14.11), we obtain Ho: B} — B5 =0 or Ho: B, — B+ 
$M +51 +3%31 —F%12 — FY —F 32 = 0. Thus C= ce’ = (0, 0, 0, 
0, 1, —1, 4, —4,4, -1, 4, -—4), eQX’X)e = 1/3, [e(KX)e] | = 3, 
and 


JI JI J JI J -I 
JI J JS J ISI J 
JI I J I J -I 
—JI J JIS J -SI J 
JI I J JI J -J 
JI J JS J -S I 


X(X/X)-e[e’(XX)-e} 'e'(X'X)-X’ = 4, 


where J is 2 x 2. This can be expressed as 


77 O22 0 2 0 
O02 02 0 2 
2702 023 0 
2B O22 0 J O 2 912 = yB—pIi2. 
2702 023 0 
O02 0 2 0 2 


Since iB is the same as eC in (14.59), the result is obtained. 


E(x) = E(eix — 0 = Eleine — Elen | = var(ejn) = oO, 


E(&ijxEimn) = El(eix 7 O)(Ennn _ 0)] = E{ [Eijk = E(exx)| [Enmn _ E(€nmn)|} 


= COV(E;jK, Elmn) = O. 


ANSWERS AND HINTS TO THE PROBLEMS 627 


14.27 


2 


bny* + bna; TOE: sided 


2 
= a bn? *2 + bn? ai? 4 (= oa] +2b7n? w* a 


i jk 


+2bnp* S- Eijk + 2bna; SS oa 


ik Jk 


= of a 2 bn? DS ai? ae Se Bie oT aS ( > earn 
i ijk i jkAlm 
i i Jk 


ik 
= = abn’ 24 By? x a’? + abno’. 
i 


14.30 Using a = 3a, y, = 2y,, and y, = 6y., (14.90) becomes B' HB = 45°, 
a? +850, a;7;+ 49>, ¥,, — 12a — 24a, y. — 127°. Show that the 10 
terms of 4)°;(a@, -@.+ 7; — 7)’ in (14.91) collapse to the same 
expression for B’HB involving 6 terms. 


=f (ut a + B+ ¥yg) = 5 Qu + 20; + 2B; + 2yy). 
k 


628 ANSWERS AND HINTS TO THE PROBLEMS 


14.32 (b) By (14.47), (14.93), (14.94) and Problem 14.31(b, c), we have 


E[SS(y|u, a, B)] = 207 +2 S (EQ) — EU, — EG) + EGP? 
ij 


= 20° +25 [ut a+ B+ %y-M—-a-B-% 
i 


—p-a—B-¥thtat+Bryy 
=20° +2 (w—-% - +7). 
yj 


14.33 
Analysis of Variance for the Lactic Acid Data in Table 13.5 
Sum of Mean 
Source Squares df Square F 
A 533.5445 1 533.5445 30.028 
B 2974.0180 4 746.5045 41.844 
AB 441.1580 4 110.2895 6.207 
Error 177.6850 10 17.7685 
Total 4126.4055 19 
The p values for these three F’s are .0003, .000003, and .009. 
14.34 
Analysis of Variance for the Hemoglobin Data in Table 13.7 
Sum of Mean 
Source Squares df Square F 
Rate 90.560375 3 30.186792 19.469 
Method 2.415125 1 2.415125 1.380 
Interaction 4.872375 3 1.624125 1.558 
Error 111.637000 72 1.550514 
Total 209.484875 79 


The p value for the first F is 2.404 x 10 °. The other two p values are 
.2161 and .3769. 


ANSWERS AND HINTS TO THE PROBLEMS 629 


Chapter 15 

15.1 m 0 0 yi 

0 m ... O y2 
Woes lio. . 4 ip wWy=] . 
0) 0 wee Ng Yk. 
yy./m yy. 
y2./N2 Yo: 
(WW) 'Wy=]} 0 f= 

Ye. /Mk Ye 


15.2 (a) The reduced model yj = w+; can be written in matrix form as 
y=pj+e*, from which we have ~#=y and jj'y=yy.= 
y2/N = Ny. 

JI. 

(b) SSB = gi'W'y — NY” = is Fos Id] 2 | - NI? 
Yk 

© Chas Hon - NE = Ee N(R) . 


15.3 (a) Sony, — 9. = d_ ny; — 2niy,d, + 297) 
i=1 i 


= So ng? — 29, Song, +P Soni 

I i i 
. Sn (%) 29. 520 +n(2) 
— i ; Nj N Fi nj N 


2 2 2 
ee 
Pa NN’ 
k Nj 
PL Lor = LY G20. +90 
pa Lo Yi 7 Vi = ay La Vij ViVi. T Yj. 
t=1 j= t J 


j 


=D (FEW) + EEE) 


2; 
“Sabo 
ij i i 
2 
Ae 
ij i 


630 ANSWERS AND HINTS TO THE PROBLEMS 


a“ -1l4- a“ = 1 = 
15.4 (c'py[c((W'W) cic a/s? = (Sy civ.) (ie? /mi) Ley /s*. 
15.5 74% = SHO 4, CMO CHE = 14+ 1=0. 


inj 


-1 


ml O.... O Yu. Yu. 
goa. Guad O ny ... O Y12. Yy2. 
15.6 (WW) W'y = : : F . = ; 
0 0 ee NY y23, ¥23. 
15.7) Wf = 011.5911. Yi05 9130 013.5 «+ +> ¥23,)- Note that fe is 6 x 1 and Wf is 


11 x 1. 
15.8 This follows by definition; see, for example, (7.41). 


15.9 Since Bu = 0, that is bi =0, we can equate bi and bj to obtain 


2p, — Maa — Bag + 2éo1 — Mar — Bo3 = M2 — Mig + Ma2 — Mo3, Which 
reduces to 2p4, + 2My; = 2M + 2Mo,. We can obtain py + Mo = 
[13 + Mo; from bp = 0. 


15.10 af = S049, and al(W'W)!a = Dy a / ny. 
15.11 a'(W’W) 'a = 3.833. 

15.12 B(W'W) 'B' = + ( ao: ah ). 

15.13 Show that KG’ = O. 

15.14 By (3.42), cov(j,) = K’(KK’)"!cov(6,)(KK’)'!K. 


15.15 (a) 
Analysis of Variance for the Weight Data of Table 14.6 


Sum of Mean 
Source df Squares Square F p Value 
Protein 4 111,762.28 27940.57 8.36 .0000169 
Error 56 181,256.71 3236.73 
Total 60 293,018.98 
(b) 
F Tests for Unweighted Contrasts 
Contrasts df Contrast SS F p Value 
L, C vs. So, Su, M 1 2,473.61 0.76 386 
So, M vs Su 1 36,261.93 11.20 .00147 
So vs. M 1 5,563.22 1.72 195 
Lvs. C 1 65,940.17 20.37 .0000332 


ANSWERS AND HINTS TO THE PROBLEMS 631 


(c) 
F Tests for Two Weighted Contrasts 
Contrast 
Contrasts df SS F p Value 
L, Cvs. So, Su, M 1 2,473.61 0.76 386 
So, M vs. Su 1 31,673.79 9.79 .00278 


15.17 (@) = (Hy, Mins Mis» Mais Hoos M23), W is 47 x 6, ais 6 x 1, B and C 
are each 2 x 6. 


a= (1, 1, 1, -l, —1) 
1 -—2 1 1 -2 1 
B= 
1 0 -1 1 0 -1 
1 -—2 1 -l 2 -1 
C= 
1 0 -1 -1 -1 1 
f= y = (96.50, 85.90, 95.17, 79.20, 91.83, 82.00)’ 


SSE = 8436.1667, vg =41 
F4 = 3.65104, Fg = .022053, Fo = 2.90567. 


(b) G is the same as C in part (a) 


SES Se eS 
— 
— 
| 
— 
| 
— 
| 
— 


ju, = (91.61, 91.31, 92.66, 83.11, 82.81, 84.15)’ 
SSE, = 9631.9072, vg = 43 
Fy, = 3.687, Fp, = 03083. 


(c) 

Analysis of Variance for Unconstrained Model 
Sum of Mean 

Source df Squares Square F p Value 
Level 1 751.238 751.238 3.65 .0630 
Type 2 9.075 4.538 02 .978 
Level x type 2 1195.7406 597.870 2.91 .0661 
Error 41 8436.167 205.760 


Total 46 10474.851 


632 ANSWERS AND HINTS TO THE PROBLEMS 


Analysis of Variance for Constrained Model 


Sum of Mean 
Source df Squares Square F p Value 
Level 1 826.544 826.544 3.69 .0614 
Type 2 13.810 6.905 0.03 .970 
Error 43 9,631.907 223.998 


Total 46 10,474.851 


15.18 Analysis of Variance for Data in Table 15.12 


Sum of Mean 
Source df Squares Square F p Value 
Fertilizer 3 20,979.042 9663.014 111.52. 5.773 x 10-4 
Variety 4 306.621 76.655 1.22 0.325 
Fertilizer x 12 997.589 83.132 1.33 0.263 
variety 
Error 26 1,630.333 62.705 


Total 45 28,486.370 


Chapter 16 


16.1 Verify that I — P is symmetric and idempotent. Then X’(I — P)X = X’'(I — 
P)'(I — P)X, and by Theorem 2.4 (iii), rank[X’(I — P)X] = rank [(I — P)X]. 
The matrix (I — P)X isn x q. To show that rank [(I — P)X = q, we demon- 
strate that the columns are linearly independent. By the definition of linear 
independence in (2.40), the columns of (I — P)X are lineraly independent 
if (X — PX)a = 0 implies a = 0. Suppose that there is a vector a0 such 
that (X — PX)a = 0. Then 


Xa = PXa = Z(Z/Z)-Z’Xa. 


By Theorem 2.8c(iii), a solution to this is Xa = z;, where z; is the ith column 
of Z. But this is impossible since the columns of Z are linearly independent 
of those of X. We therefore have Xa = 0, which implies that a = 0, since X 
is full-rank. This contradicts the possibility that a 4 0. 


16.2. (a) By (16.11) and (16.14) to (16.17), we obtain 


SSE). = yy —yZ(Z'Z) Z'y — B’X'A— Py 


= y(1— Phy — e Exley = ey — e Ey ey. 


xXyO XX xT XX 


(b) By (12.21), SSE,=y'[I — Z(Z'Z) Z'ly, which, by (16.13), equals 
y' (I — Py. 


ANSWERS AND HINTS TO THE PROBLEMS 633 


16.3 By Theorem 8.4a(ii), SSH = (CB)'A(CB), where A = [cov(CB)]~!/o2. In 
this case, C = I and by (15.19) cov(B) = o° [X(I— P)x’}"!. 

16.4 By (13.12), a solution @ is given by @ = (0, ¥,, .--, ;)'. By analogy to 
(13.7), Z'x = (x.., X1., ...,.%%.)', and by (13.11), a generalized inverse is 
(Z'Zy = diag(0, 1/n, ..., 1/n). Then (Z/Z)-Z'x = 0, %1, -.., Xe)! 

16.5 By (16.13) and (16.16), ex = «(I — P)x = x’x — x’Z(Z’Z) Z’x. From the 
answer to Problem 16.4, XZ= (x, X1., ---, Xe) and 
(LIZ) ZIx = (0,515 5%). Thus. AZZ) Z!x = ST 
=H a. and ey = Lis —n)_,x?. Show that e,, can be written as 
Cx = ar (xij — x The quantities e,, and e,, can be found in an analogous 


Manner. 


16.6 (a) By (16.39) and (16.40), we have 


I--J O Oo 
x, 0 0’ n 
0 x, 0’ o 1-2 O 
x/(I— P) = 
0 60’ Xi) I 
O O I--J 
n 
1 
x) ——J) 0! 0 
n 
1 
0! x,(I—=J) 0! 
ay n 
1 
0’ 0 .. ¥&0--J 
n 
Hae ws 0 0) 
1 iT ee 
n 
1 
0 x{I--Jx. ... 0 
X'(I— P)X = n 


1 
0 0 ee x, (I — — J)x: 


634 ANSWERS AND HINTS TO THE PROBLEMS 


To show that this equal to (16.41), we have, for example 


1 Xojj/X2 5 
x, (I-—-J )x. = xjx0 — = xj - = 
n n ; 


Show that this equals }> eyes Hy. 
(b) Using X’(I — P) from part (a), we have 


1 
x) (1-73) 0/ = 0/ 
n 


1 
0’ (1-13) eof 0’ 
x’ — P)y= a 


0’ 0’ fad x (1-13 


x, (1 — “3) 
1 A Yi 


1 
x —— Jy, 
n 


The elements of this vector are, for example 


1 XI. X2.y 
x) (1 ~3) Yor XY? 2 i 2 = s X2j92j — . 7 
j ; 


Show that this equals yi (xj — X2.)(y2j — ¥2,). 


Yi 
Y2 


Yk 


ANSWERS AND HINTS TO THE PROBLEMS 635 


16.7 


SG — Sy 0 — Fy) = Do rnin — So mindy. — S_ Fa yu 


ik ik ijk ijk 


+n x Xi Vij, 
ij 
k 


ik ij 
= 2 (s ms) +n So Xa Hy 
ij k ij 
= > XijkVijk — 1 2S Xi Dij. 


ijk i 


—n ) Xp Vij, +n ) Xi Vij. 
ij ij 
16.8 In (14.40) and (14.41), we have 


2 2: 
y <a ye 4 yen 
—n abn 7 bn abn 77 an abn 


y 


By an analogous identity, we have (note that b is replaced by c) 


XG Di XY XMiVi. XY... AG NG: 2 He 
n acn (=: cn acn ar (>: an acn 
Xij. Vij. Xi. Vi... XiYj | XY... 
+(e n 2 cn yy an ee), 


i 


Show that the right side is equal to 


cn Y.-F. Gi, — Van YB. — BIG; 3.) 
i Jj 


+n>— By —%, —%j, + ¥. My. I. — Ip, +I 
- 


= SPA + SPC + SPAC. 


636 ANSWERS AND HINTS TO THE PROBLEMS 


sii (SPC + SPR 
SSC, + SSE, 
SSC). = SS(C + Ey. — SSE). 
_ SSACx/(e— 1) 
SSE).x/[ac(n — 1) — 1] 


SS(C + E);, = SSCs + SSE, — 


The F statistic is distributed as F[c— 1, ac(n—1)—1] if Ho is true. 


(b) : 
SPAC + SPE 
SS(AC + E),. = SSAC, + SSE, mae + = 


SSAC), = SS(AC + B)y.x — SSEy.x 


_ SSAC. /(a— De - 1) 
SSE,,,/[ac(n — 1) — 1] © 


The F statistic is distributed as F[(a— 1) (c—1), ac (n—1)—1] if Ho is 
true. 


16.10 (a) Ex = X'U—P)x 


I--J (@) O 
xX 
1 
Eanice xX 
Be x! 25 Oo I re Oo 2 
—_ > D9: 20519 289 k : 
ale: 
O Oo I--J 
n 
X; 
fx; (1-15),20(1—13),....4(1—4u)] * 
— 1 Pen a Tater 
n n n : 
Xx 


A I 
= yox(1-45)x, 
i=l ut 


16.11 By Theorem 2.2c(i), the diagonal elements of X(,;X,; are products of 
columns of X,;. Thus, for example, the seond diagonal element of X/;X.j 
in (16.70) is 


Xii2 — Xi2 4 
= x ; mee 
(Xi12 — Xi.25 «+ +5 Xin2 — Xi.2) : =) Gip — 2). 


= = 
Xin2 — Xi.2 J 


ANSWERS AND HINTS TO THE PROBLEMS 637 


Similarly, the (1,2) element of X,/X,; is 


Xi12 — Xi.2 


(Kid — Xity +++ Xint — Xi) : = S° Gin — Xi )(Xy2 — Xi2). 
Xin2 — Xi.2 ce 
16.12 OD Biss CON pt TE Sassen, Ff = 
1 */ 1 1 1 
Ot ae Oo ae 2 wee 
(Z'Z)-2Xp = n 0’ j’ 0’ 2 p 
Re 3: : xX, 
@) 0 ie / / e/ 
Fi 0 0 ... fj 
0 0 0 
>< 
j 0’ 0’ Be 
/ o/ / 2 A 
0’ 0’ j’ Xk 
jX x xB 
jX. |. x | xB 
Sel) 2 (Pa lee 
/ =! = pei 
j Xx Xi Xi, 
16.13 = rr 
> Ow - 5. — 35 04-5." 
ij ij 


= S yj — 29, So yy + kny? 
ij ij 


- [S82 (n oy) Es 
y 1 J i 
= So yj — bay? — So yj + 2n Soy; — 2 S09; 
y y i i 
=n) 5, — kay. =n G.-Y. 


16.14 
(a) ex = 358.1667, ey = 488.5000, ey, = 5937.8333, B = 1.3639. 
(b) SSE). = 5271.5730 with 19df, SST).. = 6651.1917 with 22 df, 
SS(a|u, B) = 1379.6188 with 3df, F= 1.6575, p=.210. 


638 ANSWERS AND HINTS TO THE PROBLEMS 


(c) F= 2.4014, p= .138. 
(d) B; = 1.9950, B, = —.9878, B, = —1.2687, B, = 3.1646, 
(5271.5730 — 4178.2698)/3 
= = 1.3955, p = .280. 
4178.2698/16 de 


16.15 (a) 
Sum of Squares and Products for x and y 
SS and SP Corrected for the Mean 
Source y x xy 
A 268,043.37 811.11 14,627.22 
Cc 588,510.81 2485.11 1,468.89 
—AC 1,789,999.1 2411.11 — 2,736.222 
Error 7,717,172.7 5632.67 — 168,409.7 
A+C 7,985,216 6443.78 — 183,036.9 
C+E 8,305,683.5 8117.78 — 182,678.6 
AC+E 950,7171.7 8043.78 — 171,145.9 
(b) 


SS(A + E),..=2,786014, SSE). = 2,681, 934.9, SSAy.x = 104,079.06. 
104,079.06/2 


For factor A, F = = .6791,p=.514. 
orator 5 61 93402/55 
1,512, 838.46/5 

For factor C, F = = 3.9486, p = .0061. 
or recone © D681, 934.92/35 is 
3,183,799.17/10 
For interaction AC, F =~" — — 41549, p = 000783. 
or interaction AC, 2,681,934.92/35 549, p = .000783 
© pp (SPE)/SSE,  __ (—168, 409.7)? /5632.67 
SSE,.,/[ac(n — 1) — 1] 2, 681, 934.92/35 


= 65.7113, p = 1.516 x 10°°. 


(d) For factor A 
~ SPE; —50,869.67 _ 


hei = 39.9082 
' SSE, 1,274.67 : 
~ SPE) —37,796.33 
= = = —19.0187 
2 SSEx2 (1,987.33 : 
~ SPE; —79,743.67 
= = = —33.6377, 
3° SSEx,3 2,370.67 
3. (SPE;)” 
SS(F) = SSE, — “= 2,285, 831.3 
( ) ‘y » SSExj -) > > 
SPE) 
ss(R) = ssE, — SEE) — 9 681,934.9. 


SSE, 


ANSWERS AND HINTS TO THE PROBLEMS 639 


By (16.64), we obtain 


_ (2,681, 934.9 — 2, 285, 831.3)/2 


= 2.8592, p = .0716. 
2, 285, 831.3/33 “F 


For factor C 


B, = —32.8195, PB, = —30.3492, B, = —26.2928, B, = —27.8251, 
B; = —53.1667, Bs = —28.2191, 


156, 728.91/5 


= = 3724, p= 864. 
2, 525,206.01/30 3 7 P= 86 


16.16 
(a) 
Sums of Squares and Products for x and y 
SS and SP Corrected for the Mean 
Source y x xy 
A 235.225 176.4 203.70 
C 30.625 0.400 —3.50 
AC 3.025 12.10 — 6.05 
Error 867.500 5170.2 1253.10 
A+E 1102.725 5346.6 1456.80 
C+E 898.125 5170.6 1249.60 
AC+E 870.525 5182.3 1247.05 
(b)  SS(A + B)y., = 705.7875, SSEy.. = 563.7865, SSAy., = 142.0010. 
142.0010/1 
For factor A, F = = 8.8155, p= .0054. 
on factors © 563.7865 /35 oe 
32.3426/1 
For factor C, F = = 2.0078, p= .165. 
mae 563.7865 /35 4 
6.6529/1 


(c) 


For interaction AC, F = 


ALSO? | op i505: 
563.7865 /35 ee 


_ (SPE)’/SSEy 
SSE,../[ac(n — 1) — 1] 


__ (1253.10)"/5170.200 
563.7865 /35 


= 18.8546, p= .000115. 


F 


640 ANSWERS AND HINTS TO THE PROBLEMS 


(d) For factor A 


~ SPE, 996.10 
Bi SSE; 3109.7 eae 
F PE, 257. 

pee, 


~ SSE,2 2060.50 


2 2 
(SPE;) 
SS(F) = SSE, S = 516.3741, 
< SSE, i 
(SPE) 


SS(R) = SSE, — SSE. 563.7865. 


By (16.64), 


(563.7865 — 516.3741)/1 
= = 3.1218, p = .0862. 
516.3741/34 Gee ey es ree 


For factor C, 8, = .2034, 8, = .2870, 
F = 8.9930/1/554.7935/34 = .5511, p = .463. 


16.17 (a) — Ga lian : ( ) 
= 2877.4 4876.9)’ ~\ 26.219 
1450, B ey 
y=. 5 = : 
= 007414 
(b) SSE). = .67026, SST,., = .84150, 
SS(a|u, B) = SST).. — SSEy.. = .17124, 
17124 
- cae 2.8955, p= .0493. 


~ ,67026/34 


(c) To test Ho: B = 0, we use (16.84): 


My Ba ey/9 
SSE).x/[k(n — 1) — q| 


= 4.4378, p= .0194. 


ANSWERS AND HINTS TO THE PROBLEMS 641 


Asp ee bc) 7 ees 

' \ 983.4 1076.4 5.694)  \ .01076/’ 
ie Ge i) 7 ead 
2 836.0 1512.0 3.95) \  .005209 ]’ 
es a aes es ee 

3" (513.0 1552.0 12.94) \ .01086 /’ 
ot Ge ee) 7 Co 

4 \. 545.0 736.5 3.635/  \ .004224 


4 
SSE(F),.. = yy — > ey Ex ery, i = 62284, 
i=1 


SSE(R),.. = ey — CE, ery = 67026. 


XyO XX 
By (16.89) 
SSE(F),,/k(n — q— D) 


__ .047425/6 
62284 /28 


= 3553, p=.901. 


Chapter 17 


17.1 If V=PP’, let v=P | y and W=P |X. Then v is M(WB, oD, B= 
(W'W) | W’y, and 
_ (CB 1[CW'Wy 'C\CB- 1/4 
vi(I— W(W'W) |W )w/(n — k — 1) 


(a) By Theorem 8.4g(i1), F is F(g,n —k — 1). 
(b) By Theorem 8.4g(i), F is F(qg, n — k — 1, (CB—/[CCWW) !C}"! 
(C-1)/2). 


17.2. As in Problem 17.1 and using (8.49), the confidence interval is 


a! B + ta/2,n-~1\/a'(W'W) !a 


or a'B + te/2, Pee ar | al(X'(X'V-IXy a 


where s is given by (7.67) 


642 ANSWERS AND HINTS TO THE PROBLEMS 


js j;000j,0j,00000 
js i; 0000j,0j,0000 
j, 0 j;00j,000j,000 
j, 0 j,000j,000j,00 
j; 0 0j,0j,00000j, 0 
00j,00j,0000 0j, 
j; 00 0j;,j,00000 0 0j, 
000j,0j,000000 0j, 


17.3 Let Xp= 


— a — a — rr) 
oo coco c © 


Xo ja 9 0 
Xo 0 ju: 9 


mio 
Ss 
mie 
= 
— a) 
oe 
a) 
— a) 


0 0 + ji 00 =: j; 
17.4 (a) cov(y) = cov Sza+*) 
i=1 


= S° “GZ, +R. 
i=1 
(b) cov(y) = ZGZ' +R. 


17.5 Using (5.4), 


ElyK'(K3K')'KZ;Z)K'(K=K’) Ky] 
= tr[K/(KEK”~!KZ,Z/K'/ (KEK? 'K] 
+ B'X'K'(KEK’) 'KZZ/K'(K=K’)'KXB 
= tr[KXK’/(K>K’)!KZ,Z/K'(K2K’)"'] +0 (since KX = O) 
= tr[KZ,Z/K'(K3K') |] 


= tr[K’(KYK’) 'KZ,Z'). 


ANSWERS AND HINTS TO THE PROBLEMS 643 
17.6 q is obvious; 


tr[K’(K2K’) 'KZ;Z/] = tr[(KXK’) 'KZ;Z’k’] 
= tr(KEK’) 'KZ,Z/K'(K=K’) 'K=K’] 


= tr|K'(KEK’) 'KZ)Z)K' (KEK’)'K §° Z;Zjo7 
i=0 


= m) o, where the jth element of m, is 


tr[K'(KXK’)'KZ,Z)K'(KEK’) 'KZ)Z'). 


17.7 (a) If  =PP’, let v=P ‘'y and W=P 'X. Then v is N(WB, DI) and 
L(X'S-'X)-L’ = L(WW)_L’. Since L is estimable, L = AX = 
APP’X = BW. Hence the rows of L also define estimable functions 
of v. Thus by Theorem 12.7b, L(X’~'X)L’ is nonsingular. 


(b) Since LB—-LB is N[O, L(W'W)-L’], note that [LCW'W)-L’]~! 
L(W'W) L’ =I, which is idempotent of rank g. 
17.8 x)! B is estimable and is N[Xo' B, Xo' (W/W) xo]. Hence 


xB — xB 
4/ xy(W'W) Xo 


is N(O, 1). 


Thus a 100(1 — a)% confidence interval for x’ B is 


xB + Za/2\/ Xp(W'W) x0 or 
XB + Za/r\/ xp(X'D 'X)“x0. 


644 ANSWERS AND HINTS TO THE PROBLEMS 


17.9 


3' 0 
al 
~ OS. ea 56 
(XE XY = |d6 I) () 
: : : Is 
0 Oo > 
Re, —l 
>i O 
B21 OG. Se" 0 
0 oO & 
1 DF O oO 
=5 Oo 0 
Oo oO 


17.10 Let C= (hk, O) and K= CO — W)= 1(&, —I). Then K(KEK’)"! K 


T _T >, Oo oO 
as a—1 : 
= H(t :) where T = O: : 3; io . Since Zp = Ij. and 
O O Dv 
jo 0 0 
a | 
Z, = . : . |, the REML equations become 
OO) cen. 


3 (2!) = y [K’(KEK)]'KK'(K3K’)K]y, | and 
3 tr(S Jn) = y'[K'(K2K’)!KJK'(KEK’)K]y. 


2 
Noting that 37! = u ! ) , the REML equations 


—6 
& (6° +267) —6 + a 


ANSWERS AND HINTS TO THE PROBLEMS 645 


can be written as 


OE +267) 26? +2627 


U _U Jo O O 
where u=y’ Ca a y and U= O J, O |}. The second 
O O Jz 
uo 
REML equation can be rearranged 6? = ao 5 Substituting this 
expression into the first REML equation and then simplifying, we 
obtain 1 ae = uae a ‘Py where P = 
or. Rg ~ 
R O O -R O O 
oO R O O -R O 
Oo oO R O O -R ae eer noe. Aes 
_R O O R O o |: which simplifies to 6° = DY Py. 
Oo -RK O O R O 
Oo O -R O O R 


X(X'X)!L’QL(X'X) |X’ 


LI \ if 2 S14 
= (sal 5) 3 16) 


>, Oo O 
1202? \W W : 
Oo oO > 
2 —2 -il 1 -l 1 
—2 2 1 -l 1 -l 
—l 1 2 —2 -l 1 
where 


646 ANSWERS AND HINTS TO THE PROBLEMS 


Now, 
1/W W\1/(W WwW 
122\w w/i2\w w 
— 1 (2W 2Ww 
~ 144 2w? 2W? 
—1(Www 
12\W Ww 
17.12 1 (/W W)\C1 
a (w Wea? 
ROO -R O O 
1(/Www 
=— O RO O-R O 
24\W WwW 
O OR O O-R 
= 0. 
a2 a2 
17.13 e'(x’S xy Hey! _ 62 4.262 — @ ee 
Ww 


Using results from the solution to Problem 17.10, v= xa 
(+ yPy+ Hu G’) = + y'Dy where D is a nonzero square matrix not 


involving o~. Hence (<4:D) we ioD. 


(n—Wle(X/E "Xe (n—Hle(X’X)-ds?_ MW — Hs? 


17.14 
[e'(X'E Xe] [e/(X’X) e]o? oe 
afc) 0 ae 
17.15 - x! 
ao? age X)"e] 
= —¢'(X’S'X) - ( — w’s'x)) (xd x) 


[ by an extension of (2.117)] 


es XE 1Xy XE ( 0 


-1 I~ lywy- 
sae) X(X’2 X)e 


= ¢(X’>'X) x's! ( os > v2) > 'X(X’S | xye 


= ¢ (XS X) X'S 'Z,Z)5 XXX). 


ANSWERS AND HINTS TO THE PROBLEMS 


17.16 (a) (LB — Lp)[LX'S | X-}'(Lp — LB) 
= (LB — Lp) PD 'P’(LB - LB) 


6 es IYOPP) ye 
= (LB - Lpy SLB - LB) 
i=T> 


_ 7 Wb -Le 


i=l Xi 


(b) note that var[p/(LB — LB)] = var(p/L B) 
~—1 
= pL’ > X) Lp; 
= pIL('S 'X) Lp, 
~—1 
= |[P'L(X’S X) LP), 
= [P'PDP’P),, = Dj = ;. 
(©) cov(piLB, pl, LB) = piL(X’S“X)L'p| 
= pj/PDP’p; 
= AiP{P;P/ Py = AipjOp; = 0. 


17.17 (a) We use Theorems 5.2a and 5.2e to obtain 
Ela — Bly — XB)|'[a— Bly — XP) 
= E(a’a) — Ela'Biy — XB) — Ely — XB)'B’a) 
+ Ely — XByB'Biy — XB) 
= tr(V) — tr[B cov(y, a)] — tr[B’cov(a, y] + tr(B’B>) 
= tr(V) — tr(BZV)' — tr(B’VZ’) + tr(BYB’) 
= tr(V) + tr(BZV' — B'VZ’ + BSB’) 


= tr(V) + t[B— VWZ'>~')3(B — Vz/S"')' — vz'a' ZV}. 


(b) Since E(a)=0, E(y — XB)=0, and cov(a, y)=VZ’, we have 


Ela — Bly — XB)|[a — Bly — BB)’ 
= E(aa’) — Ela(y — XB)'B') — E[Bty — XB)a’] 
+ E[Bly — XB)(y — XB)'B’ 
= cov(a) — [cov(a, y)|B’ — B cov(y, a) + B cov(y)B’ 
= V—VZ'B' — BZV' + BB’ 


=V+(B-VZ'S"|')3(B- VZ/>"'Y — vz! ZV". 


647 


648 ANSWERS AND HINTS TO THE PROBLEMS 


The first and third terms do not involve B, and the second term is 
“minimized” by B=VZ/S'. By “minimize,” we mean that any 
other choice for B adds a positive definite matrix to the result. This 
holds because & is positive definite. 


17.18 (I — X(X’S'X) XS SE — XX’ XX’ 
= [— XX’) 'X)-Xx’][I — | X(X' |X)" xX’ 
= >—X(X’>'x)- Xx’ — x(x’ |X) xX’ 


+ X(X’> xy x’S Xx’ xy x’ 
=> -—X(X’>'X)x’. 


17.19 (a) Using Problem 17.17, the BLP OF Ua is E(Ualy)=UE(aly)= 
UGZ'X"'(y — XB). 
(b) cov[UGZ’S~!(y — XB)| = UGZ/S"'SS'ZGU' = UGZ'S"'ZGU’. 
(c) cov[UGZ’S'(y — XB)| 
= cov{UGZ’S [I — X(X'S' x) x’ |Jy} 
= UGZ/S | [I — XX’ 1X) X'S} 
x SP — X(X’E7 x)" x’S7 |] -!' zu" 


= UGZ'S'[ — X(X'"S"'X)- x)= 'ZGU' 
[using Problem 17.18] 


= UGZ'[S7! — 3 xcx’'d xy x’ }ZGU’. 


ANSWERS AND HINTS TO THE PROBLEMS 649 


17.20) 21 =(C I+ AZZ)! 


714+ of, oO a O ic 
O Ply+oijaiy O 
O O Oly + ot jais 
(Plt oF jai) | fe) és oO 
O (Plt ofjai) | O 
O O (P+ 074i) | 
[by(2.52)] 
oe 
pe Ns 
(gg) 0 . 
a 
| 8 Meatgh) = 0 
1 o 
een ee Se 
: ee) 
[by(2.53)] 
17.21 
cov[EBLUP(a)| 


~ 62 ' — 3 'xx'd xy x’d|]zG 


ji, 0 0 ju 9 0 
A . eat | acl, . aa —le A= 1 . 
= 64 0 j, OF -2 joGo® ji) hie ]} 0 j, 0 
0 0 Fi, 00 ix 
i 0 0 j 0 0 
a4 rs ! ’ ool 12 ‘ * 
= 0; 0 Ja 0 -a a IJ 0 Ja 0 
! U </ a +46; * 
0 0 j, 0 0 jy 
8 -4 -4 
a4 
Zt g —4 


3(6° + 467) 


650 ANSWERS AND HINTS TO THE PROBLEMS 


17.22. cov(a, y) =cov(a, XB + Za+ e) = cov(a, Za) = GZ’. Hence, cov(aly) = 
Yaa — LayZyy Lou = G — GZ'E"'ZG. 

17.23 cov(aly) = 039 — 3Z/E'Z) = OFT — Zslio. Note that the off- 
diagonal elements are 0’s. 


17.24 Since ~'/? is symmetric, let ¥~!/7= .) . Then 


Io — Hy = Ly — 3 ?XX'S IX) IX’? 


a a 
a a i 0 

= eee maz aass: a a 
ate Oe eee ae ae ete Sd ae 
a —a ¢ 
a —a 


_y,_ (ldo 0 
joes O 0.1910 


The off-diagonal elements are either 0 or —0.1 (corresponding to corre- 
lations of either O or — 0.11). 


Chapter 18 


18.1 E(y) = (0)PQ; = 0) + (PO; = 1) = Lp = pi, 


var(y;) = Ely; — Eo | 
= (0 — p)’ PO; = 0) + (1 — py PO: = I) 


= pi(1—p) + (1 — pip; 
= pil — pi)[pi + A — pi). 


18.2 Let 6; = By + B)x;. Then (18.7) becomes p; = e% /(1 + e%). From this we 


obtain 
ef 1+e% — ef 1 
1-—p,=1 = = 
1+e% 1+e% 1+e%, 
Pi ef j. 1 _ e&+e%) | 6; 
l-p; 1+e%/ 1+e% 1+e% ene 


Pi 
In| ——— ] = 6. 
n( 2) 0, 


18.3 


18.4 


18.5 


ANSWERS AND HINTS TO THE PROBLEMS 651 


InL(Bo, By) = In Tee “i 


i=1 


= S- pilnp; + (1 — y) nd — pp] 


i=1 


= )cyilinp; — nl — pi] + 9 In — pi) 


= Yin(, na) + }omna =i). 


By Problem 17.2, this becomes 


In L(Bo, B1) = )_ yi(Bo + Bix) + >> In (1 = pi). 
To show that In(1 — p;) = —In(1 + e+), let 6, = In[p;/(1 — p;)]. Then 


ef — Z : 
1— pi 


Solve this no obtain p; = e% /(1 + e%). Then show that 1 — p; = 1/d+ e%) 
and that In(1 — p;) = —In(1 + e%) = —In(1 + eft Fix), 


ePot Bixi 


Ol ee » By) : 
: a =>» 2 1 + ePot Bix * 


OlnL de a Bo+ Bi Xi 
Spe ae 
OB i=1 a hee 


b(6;) = nj ln (1 — p;) = —n; In (1 + e®), as shown in the answer to Problem 
17.3. 


References 


Agresti, A. (1984). Analysis of Ordinal Categorical Data. New York: Wiley. 

Agresti, A. (1990). Categorical Data Analysis. New York: Wiley. 

Anderson, E. B. (1991). The Statistical Analysis of Categorical Data (2nd ed.). New York: 
Springer-Verlag. 

Anderson, T. W. (1984). Introduction to Multivariate Statistical Analysis (2nd ed.). New York: 
Wiley. 

Andrews, D. F. (1974). A robust method for multiple linear regression. Technometrics 16, 
523-531. 

Andrews, D. F. and A. M. Herzberg (1985). Data. New York: Springer-Verlag. 

Bailey, B. J. R. (1977). Tables of the Bonferroni t-statistic. Journal of the American Statistical 
Association 72, 469-479. 

Bates, D. M. and D. G. Watts (1988). Nonlinear Regression and Its Applications. New York: 
Wiley. 

Beckman, R. J. and R. D. Cook (1983). Outliers (with comments). Technometrics 25, 
119-163. 

Belsley, D. A., E. Kuh, and R. E. Welsch (1980). Regression Diagnostics: Identifying Data and 
Sources of Collinearity. New York: Wiley. 


Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: A practical and 
powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B: 
Methodological 57, 289-300. 

Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery rate in multiple 
testing under dependency. The Annals of Statistics 29(4), 1165-1188. 

Benjamini, Y. and D. Yekutieli (2005). False discovery rate-adjusted multiple confidence inter- 
vals for selected parameters. Journal of the American Statistical Association 100(469), 
71-81. 

Bingham, C. and S. E. Feinberg (1982). Textbook analysis of covariance—is it correct? 
Biometrics 38, 747-753. 

Birch, J. B. (1980). Some convergence properties of iterated reweighted least squares in the 
location model. Communications in Statistics B9(4), 359-369. 


Bishop, Y., S. Fienberg, and P. Holland (1975). Discrete Multivariate Analysis: Theory and 
Practice. Cambridge, MA: Massachusetts Institute of Technology Press. 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 
Copyright © 2008 John Wiley & Sons, Inc. 


653 


654 REFERENCES 


Bloomfield, P. (2000). Fourier Analysis of Time Series: An Introduction. New York: Wiley. 

Bonferroni, C. E. (1936). Il calcolo delle assicurazioni su gruppi di teste. Studii in Onore del 
Profesor S. O. Carboni. Roma. 

Box, G. E. P. and P. V. Youle (1955). The exploration and exploitation of response surfaces: 
An example of the link between the fitted surface and the basic mechanism of the system. 
Biometrics 11, 287-323. 

Broadbent, K. L. (1993). A Comparison of Six Bonferroni Procedures. Master’s thesis, 
Department of Statistics, Brigham Young University. 

Brown, H. and R. Prescott (1999). Applied Mixed Models in Medicine. New York: Wiley & 
Sons. 

Brown, H. and R. Prescott (2006). Applied Mixed Models in Medicine (2nd ed.). Hoboken, NJ: 
Wiley. 

Bryce, G. R. (1975). The one-way model. The American Statistician 29, 69-70. 

Bryce, G. R. (1998). Personal communication. 

Bryce, G. R., M. W. Carter, and M. W. Reader (1976). Nonsingular and singular transform- 
ations in the fixed model. Annual Meeting of the American Statistical Association, 
Boston, Aug. 1976. 

Bryce, G. R., M. W. Carter, and D. T. Scott (1980a). Recovery of Estimability in Fixed Models 
with Missing Cells. Technical Report SD-022-R, Department of Statistics, Brigham Young 
University. 

Bryce, G. R., D. T. Scott, and M. W. Carter (1980b). Estimation and hypothesis testing in linear 
models—a reparameterization approach. Communications in Statistics—Series A, Theory 
and Methods 9, 131-150. 

Casella, G. and E. I. George (1992). Explaining the Gibbs sampler. The American Statistician 
46, 167-174. 

Chatterjee, S. and A. S. Hadi (1988). Sensitivity Analysis in Linear Regression. New York: 
Wiley. 

Christensen, R. (1996). Plane Answers to Complex Questions: The Theory of Linear Models 
(2nd ed.). New York: Springer-Verlag. 

Christensen, R. (1997). Log-Linear Models and Logistic Regression (2nd ed.). New York: 
Springer-Verlag. 

Cochran, W. G. (1934). The distribution of quadratic forms in a normal system with appli- 
cations to the analysis of variance. Proceedings, Cambridge Philosophical Society 30, 
178-191. 

Cochran, W. G. (1977). Sampling Techniques. New York: Wiley. 

Cook, R. D. (1977). Detection of influential observations in linear regression. Technometrics 
19, 15-18. 

Cook, R. D. and S. Weisberg (1982). Residuals and Influence in Regression. New York: 
Chapman & Hall. 

Crampton, E. W. and J. W. Hopkins (1934). The use of the method of partial regression in the 
analysis of comparative feeding trial data. Part II. J. Nutrition 8, 329-339. 

Daniel, W. W. (1974). Biostatistics: A Foundation for Analysis in the Health Sciences. 
New York: Wiley. 

Devlin, S. J., R. Gnanadesikan, and J. R. Kettenring (1975). Robust estimation and outlier 
detection with correlation coefficients. Biometrika 62, 531-546. 


REFERENCES 655 


Diggle, P., P. Heagerty, K.-Y. Liang, and S. L. Zeger (2002). Analysis of Longitudinal Data. 
Oxford University Press. 

Dobson, A. J. (1990). An Introduction to Generalized Linear Models. New York: Chapman & 
Hall. 

Draper, N. R. and H. Smith (1981). Applied Regression Analysis (2nd ed.). New York: Wiley. 

Draper, N. R. and H. Smith (1998). Applied Regression Analysis. New York: Wiley. 

Driscoll, M. F. (1999). An improved result relating quadratic forms and chi-square distri- 
butions. The American Statistician 53, 273-275. 

Eubank, R. L. and R. L. Eubank (1999). Nonparametric Regression and Spline Smoothing. 
New York: Marcel Dekker. 

Evans, M. and T. Swartz (2000). Approximating Integrals via Monte Carlo and Deterministic 
Methods. Oxford University Press. 

Ezekiel, M. (1930). Methods of Correlation Analysis. New York: Wiley. 

Fai, A. H.-T. and P. L. Cornelius (1996). Approximate F-tests of multiple degree of freedom 
hypotheses in generalized least squares analyses of unbalanced split-plot experiments. 
Journal of Statistical Computation and Simulation 54, 363-378. 

Fisher, R. A. (1921). On the probable error of a coefficient of correlation deduced from a small 
sample. Metron 1, 1-32. 

Fitzmaurice, G. M., N. M. Laird, and J. H. Ware (2004). Applied Longitudinal Analysis. 
Hoboken, NJ: Wiley. 

Flury, B. W. (1989). Understanding partial statistics and redundancy of variables in regression 
and discriminant analysis. The American Statistician 43(1), 27-31. 

Fox, J. (1997). Applied Regresion Analysis, Linear Models, and Related Methods. Thousand 
Oaks, CA: SAGE Publications. 

Freund, R. J. and P. D. Minton (1979). Regression Methods: A Tool for Data Analysis. 
New York: Marcel Dekker. 

Fuller, W. A. and G. E. Battese (1973). Transformations for estimation of linear models with 
nested-error structure. Journal of the American Statistical Association 68, 626-632. 

Gallant, A. R. (1975). Nonlinear regression. The American Statistician 29, 73-81. 

Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (2004). Bayesian Data Analysis (2nd 
ed.). Chapman & Hall/CRC. 

Ghosh, B. K. (1973). Some monotonicity theorems for chi-square. F and f¢ distributions with 
applications. Journal of the Royal Statistical Society 35, 480-492. 

Giesbrecht, F. G. and J. C. Burns (1985). Two-stage analysis based on a mixed model: Large- 
sample asymptotic theory and small-sample simulation results. Biometrics 41, 477-486. 

Gilks, W. R. E., S. E. Richardson, and D. J. E. Spiegelhalter (1998). Markov Chain Monte 
Carlo in Practice. London: Chapman & Hall. 

Gomez, E., G. Schaalje, and G. Fellingham (2005). Performance of the kenward-roger method 
when the covariance structure is selected using aic and bic. Communications in Statistics: 
Simulation and Computation 34(2), 377-392. 

Graybill, F. A. (1954). On quadratic estimates of variance components. Annals of Mathematical 
Statistics 25(2), 367-372. 

Graybill, F. A. (1969). Introduction to Matrices with Applications in Statistics. Belmont, CA: 
Wadsworth Publishing Company. 


656 REFERENCES 


Graybill, F. A. (1976). Theory and Application of the Linear Model. North Scituate, MA: 
Duxbury Press. 


Graybill, F. A. and H. K. Iyer (1994). Regression Analysis: Concepts and Applications. North 
Scituate, MA: Duxbury Press. 


Graybill, F. A. and A. W. Wortham (1956). A note on uniformly best, unbiased estimators for 
variance components. Journal of the American Statistical Association 51, 266-268. 


Gutsell, J. S. (1951). The effect of sulfamerazine on the erythrocyte and hemoglobin content of 
trout blood. Biometrics 7(2), 171-179. 


Guttman, I. (1982). Linear Models: An Introduction. New York: Wiley. 
Hald, A. (1952). Statistical Theory with Engineering Applications. New York: Wiley. 


Hamilton, D. (1987). Sometimes R? > nr aa + Tess Correlated variables are not always redun- 
dant. The American Statistician 41(2), 129-132. 


Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the 
American Statistical Association 69, 383—393. 


Hartley, H. O. (1956). Programming analysis of variance for general purpose computers. 
Biometrics 12, 110-122. 


Harville, D. A. (1997). Matrix Algebra from a Statistician’s Perspective. New York: Springer- 
Verlag. 


Healy, M. J. R. and M. Westmacott (1969). Missing values in experiments analysed on auto- 
matic computers. Applied Statistics 5, 203-206. 


Helland, I. S. (1987). On the interpretation and use of R? in regression analysis. Biometrics 43, 
61-69. 


Henderson, C. R. (1950). Estimation of genetic parameters. Annals of Mathetmatical Statistics 
21, 309-310. 


Henderson, C. R. and A. J. McAllister (1978). The missing subclass problem in two-way fixed 
models. Journal of Animal Science 46, 1125-1137. 


Hendrix, L. J. (1967, Aug.). Auditory Discrimination Differences between Culturally Deprived 
and Nondeprived Preschool Children. PhD thesis, Brigham Young University. 


Hendrix, L. J., M. W. Carter, and J. Hintze (1978). A comparison of five statistical methods for 
analyzing pretest-post designs. Journal of Experimental Education 47, 96-102. 


Hendrix, L. J., M. W. Carter, and D. T. Scott (1982). Covariance analysis with heterogeneity of 
slopes in fixed models. Biometrics 38, 641-650. 


Hilbe, J. M. (1994). Generalized linear models. The American Statistician 48, 255-265. 


Hoaglin, D. C. and R. E. Welsch (1978). The hat matrix in regression and ANOVA. The 
American Statistician 32, 17-22. 


Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. 
Biometrika 75, 800—802. 


Hocking, R. R. (1976). The analysis and selection of variables in linear regression. Biometrics 
32, 1-51. 

Hocking, R. R. (1985). The Analysis of Linear Models. Monterey, CA: Brooks/Cole. 

Hocking, R. R. (1996). Methods and Applications of Linear Models. New York: Wiley. 


Hocking, R. R. (2003). Methods and Applications of Linear Models (2nd ed.). New York: 
Wiley. 


REFERENCES 657 


Hocking, R. R. and F. M. Speed (1975). A full rank analysis of some linear model problems. 
Journal of the American Statistical Association 70, 706-712. 

Hoerl, A. E. and R. W. Kennard (1970). Ridge regression: Biased estimation for nonorthogonal 
problems. Technometrics 12, 55-67. 

Hogg, R. V. and A. T. Craig (1995). Introduction to Mathematical Statistics (5th ed.). 
Englewood Cliffs, NJ: Prentice-Hall. 

Holland, B. (1991). On the application of three modified Bonferroni procedures to pairwise 
multiple comparisons in balanced repeated measures designs. Computational Statistics 
Quarterly 3, 219-231. 

Holland, B. and M. D. Copenhaver (1987). An improved sequentially rejective Bonferroni test 
procedure. Biometrics 43, 417-423. 

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal 
of Statistics 6, 65-70. 

Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified 
Bonferroni test. Biometrika 75, 383-386. 

Hosmer, D., B. Jovanovic, and S. Lemeshow (1989). Best subsets logistic regression. 
Biometrics 45, 1265-1270. 

Hosmer, D. W. and S. Lemeshow (1989). Applied Logistic Regression. New York: Wiley. 

Huber, P. J. (1973). Robust regression: Asymptotics, conjectures, and Monte Carlo. Annals of 
Statistics 1, 799-821. 

Hummel, T. J. and J. Sligo (1971). Empirical comparison of univariate and multivariate analy- 
sis of variance procedures. Psychological Bulletin 76, 49-57. 

Jammalamadaka, S. R. and D. Sengupta (2003). Linear Models an Integrated Approach. 
Singapore: World Scientific Publications. 

Jeske, D. R. and D. A. Harville (1988). Prediction-interval procedures and (fixed-effects) 
confidence-interval procedures for mixed linear models. Communications in Statistics: 
Theory and Methods 17, 1053-1087. 

Jgrgensen, B. (1993). The Theory of Linear Models. New York: Chapman & Hall. 

Kackar, R. N. and D. A. Harville (1984). Approximations for standard errors of estimators of 
fixed and random effects in mixed linear models. Journal of the American Statistical 
Association 79, 853-862. 

Kendall, M. G. and A. Stuart (1969). The Advanced Theory of Statistics (3rd ed.), Vol. 1. 
New York: Hafner. 

Kenward, M. G. and J. H. Roger (1997). Small sample inference for fixed effects from 
restricted maximum likelihood. Biometrics 53, 983-997. 

Keselman, H. J., R. K. Kowalchuk, J. Algina, and R. D. Wolfinger (1999). The analysis of 
repeated measurements: A comparison of mixed-model Satterthwaite F tests and a non- 
pooled adjusted degrees of freedom multivariate test. Communications in Statistics: 
Theory and Methods 28, 2967-2999. 

Kleinbaum, D. G. (1994). Logistic Regression. New York: Springer-Verlag. 

Krasker, W. S. and R. Welsch (1982). Efficient bounded-influence regression estimation. 
Journal of the American Statistical Association 77, 595-604. 

Kshirsagar, A. M. (1983). A Course in Linear Models. New York: Marcel Dekker. 

Ku, H. H. and S. Kullback (1974). Loglinear models in contingency table analysis. The 
American Statistician 28, 115-122. 


658 REFERENCES 


Kutner, M. H., C. J. Nachtsheim, J. Neter, and W. Li (2005). Applied Linear Statistical Models 
(5th ed.). New York: McGraw-Hill /Irwin. 

Lehmann, E. L. (1999). Elements of Large-Sample Theory. New York: Springer-Verlag. 

Lindley, D. V. and A. F. M. Smith (1972). Bayes estimates for the linear model (with discus- 
sion). Journal of the Royal Statistical Society, Series B: Methodological 34, 1-41. 

Lindsey, J. K. (1997). Applying Generalized Linear Models. New York: Springer-Verlag. 

Little, R. J. A. and D. B. Rubin (2002). Statistical Analysis with Missing Data. Hoboken, NJ: 
Wiley. 

Mahalanobis, P. C. (1936). On the generalized distance in statistics. Proceedings of the 

National Institute of Science of India 12, 49-55. 

Mahalanobis, P. C. (1964). Professor Ronald Aylmer Fisher. Biometrics 20, 238-250. 

Marcuse, S. (1949). Optimum allocation and variance components in nested sampling with an 

application to chemical analysis. Biometrics 5(3), 189-206. 

McCullagh, P. and J. A. Nelder (1989). Generalized Linear Models (2nd ed.). New York: 

Chapman & Hall. 

McCulloch, C. E. and S. R. Searle (2001). Generalized, Linear, and Mixed Models. New York: 

Wiley. 

Mclean, R. A., W. L. Sanders, and W. W. Stroup (1991). A unifed approach to mixed linear 

models. American Statistician 45, 54-64. 

Mendenhall, W. and T. Sincich (1996). A Second Course in Statistics: Regression Analysis. 

Englewood Cliffs, NJ: Prentice-Hall. 

Milliken, G. A. and D. E. Johnson (1984). Analysis of Messy Data, Vol. 1: Designed 

Experiments. New York: Van Nostrand-Reinhold. 

Montgomery, D. C. and E. A. Peck (1992). Introduction to Linear Regresion Analysis (2nd 

ed.). New York: Wiley. 

Morrison, D. F. (1983). Applied Linear Statistical Methods. Englewood Cliffs, NJ: Prentice- 

Hall. 

Mosteller, F. and J. W. Tukey (1977). Data Analysis and Regression. Reading, MA: Addison- 

Wesley. 

Muller, K. E. and M. C. Mok (1997). The distribution of Cook’s D statistic. Communications in 

Statistics: Theory and Methods 26, 525-546. 

Myers, R. H. (1990). Classical and Modern Regression with Applications (2nd ed.). Boston: 

Duxbury Press. 

Myers, R. H. and J. S. Milton (1991). A First Course in the Theory of Linear Statistical Models. 

Boston: PWS-Kent. 

Nelder, J. A. (1974). Letter to the editor. Journal of the Royal Statistical Society, Series C 23, 

232. 

Nelder, J. A. and P. W. Lane (1995). The computer analysis of factorial experiments: In mem- 

oriam—Frank Yates. The American Statistician 49, 382—385. 

Nelder, J. A. and R. W. M. Wedderburn (1972). Generalized linear models. Journal of the 
Royal Statistical Society, Series A 135, 370-384. 

Ogden, R. T. (1997). Essential Wavelets for Statistical Applications and Data Analysis. 
Birkhauser. 


REFERENCES 659 


Ostle, B. and L. C. Malone (1988). Statistics in Research: Basic Concepts and Techniques for 
Research Workers (4th ed.). Ames: Iowa State University Press. 

Ostle, B. and R. W. Mensing (1975). Statistics in Research (3rd ed.). Ames: Iowa State 
University Press. 

Patterson, H. D. and R. Thompson (1971). Recovery of inter-block information when block 
sizes are unequal. Biometrika 58, 545-554. 

Pawitan, Y. (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood. 
Oxford University Press. 

Pearson, E. S., R. L. Plackett, and G. A. Barnard (1990). Student: A Statistical Biography of 
William Sealy Gossett. New York: Oxford University Press. 

Plackett, R. L. (1981). The Analysis of Categorical Data (2nd ed.). London: Griffin. 

Rao, C. R. (1965). Linear Statistical Inference and Its Applications. New York: Wiley. 

Rao, P. S. R. S. (1997). Variance Components Estimation. London: Chapman & Hall. 

Ratkowsky, D. A. (1983). Nonlinear Regression Modelling: A Unified Approach. New York: 
Marcel Dekker. 

Ratkowsky, D. A. (1990). Handbook of Nonlinear Regression Models. New York: Marcel 
Dekker. 

Read, T. R. C. and N. A. C. Cressie (1988). Goodness-of-Fit Statistics for Discrete Multivariate 
Data. New York: Springer-Verlag. 


Reader, M. W. (1973). The Analysis of Covariance with a Single Linear Covariate Having 
Heterogeneous Slopes. Master’s thesis, Department of Statistics, Brigham Young 
University. 


Rencher, A. C. (1993). The contribution of individual variables to Hotelling’s T°, Wilks’ A and 
R?. Biometrics 49, 217-225. 


Rencher, A. C. (1995). Methods of Multivariate Analysis. New York: Wiley. 

Rencher, A. C. (1998). Multivariate Statistical Inference and Applications. New York: Wiley. 

Rencher, A. C. (2002). Multivariate Statistical Inference and Applications. Hoboken, NJ: 
Wiley. 

Rencher, A. C. and D. T. Scott (1990). Assessing the contribution of individual variables 


following rejection of a multivariate hypothesis. Communications in Statistics—Series B, 
Simulation and Computation 19, 535-553. 


Rom, D. M. (1990). A sequentially rejective test procedure based on a modified Bonferroni 
inequality. Biometrika 77, 663-665. 


Ross, S. M. (2006). Introduction to Probability Models (9th ed.). San Diego, CA: Academic 
Press. 


Royston, J. P. (1983). Some techniques for assessing multivariate normality based on the 
Shapiro-Wilk W. Applied Statistics 32, 121-133. 


Ryan, T. P. (1997). Modern Regression Methods. New York: Wiley. 

Santner, T. J. and D. E. Duffy (1989). The Statistical Analysis of Discrete Data. New York: 
Springer-Verlag. 

Satterthwaite, F. E. (1941). Synthesis of variances. Psychometrika 6, 309-316. 


Saville, D. J. (1990). Multiple comparison procedures: The practical solution (C/R: 91V45 
pl65-—168). The American Statistician 44, 174-180. 


660 REFERENCES 


Schaalje, G. B., J. B. McBride, and G. W. Fellingham (2002). Adequacy of approximations to 
distributions of test statistics in complex mixed linear models. Journal of Agricultural, 
Biological, and Environmental Statistics 7(4), 512-524. 


Scheffé, H. (1953). A method of judging all contrasts in the analysis of variance. Biometrika 
40, 87-104. 


Scheffé, H. (1959). The Analysis of Variance. New York: Wiley. 

Schott, J. R. (1997). Matrix Analysis for Statistics. New York: Wiley. 

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, 461-464. 
Searle, S. R. (1971). Linear Models. New York: Wiley. 


Searle, S. R. (1977). Analysis of Variance of Unbalanced Data from 3-Way and Higher-Order 
Classifications. Technical Report BU-606-M, Cornell University, Biometrics Units. 


Searle, S. R. (1982). Matrix Algebra Useful for Statistics. New York: Wiley. 

Searle, S. R., G. Casella, and C. E. McCulloch (1992). Variance Components. New York: 
Wiley. 

Searle, S. R., F. M. Speed, and H. V. Henderson (1981). Some computational and model equiv- 


alencies in analysis of variance of unequal-subclass-numbers data. The American 
Statistician 35, 16-33. 

Seber, G. A. F. (1977). Linear Regression Analysis. New York: Wiley. 

Seber, G. A. F. and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: 
Wiley. 

Seber, G. A. F. and C. J. Wild (1989). Nonlinear Regression. New York: Wiley. 


Sen, A. and M. Srivastava (1990). Regression Analaysis: Theory, Methods, and Applications. 
New York: Springer-Verlag. 


Shaffer, J. P. (1986). Modified sequentially rejective multiple test procedures. Journal of the 
American Statistical Association 81, 826—831. 


Silverman, B. W. (1999). Density Estimation for Statistics and Data Analysis. London: 
Chapman & Hall. 


Simes, R. J. (1986). An improved Bonferroni procedure for multiple tests of significance. 
Biometrika 73, 751-754. 


Snedecor, G. W. (1948). Answer to query. Biometrics 4(2), 132-134. 


Snedecor, G. W. and W. G. Cochran (1967). Statistical Methods (6th ed.). Ames: Iowa State 
University Press. 


Snee, R. D. (1977). Validation of regression models: Methods and examples. Technometrics 
19, 415-428. 


Speed, F. M. (1969). A New Approach to the Analysis of Linear Models. Technical report, 
National Aeronautics and Space Administration, Houston, TX; a NASA Technical 
memo, NASA TM X-58030. 


Speed, F. M., R. R. Hocking, and O. P. Hackney (1978). Methods of analysis of linear models 
with unbalanced data. Journal of the American Statistical Association 73, 105-112. 


Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. van der Linde (2002). Bayesian measures 
of model complexity and fit (pkg: P583-639). Journal of the Royal Statistical Society, 
Series B: Statistical Methodology 64(4), 583-616. 


Stapleton, J. H. (1995). Linear Statistical Models. New York: Wiley. 
Stigler, S. M. (2000). The problematic unity of biometrics. Biometrics 56(3), 653-658. 


REFERENCES 661 


Stokes, M. E., C. S. Davis, and G. G. Koch (1995). Categorical Data Analysis Using the SAS 
System. Cary, NC: SAS Institute. 

Theil, H. and C. Chung (1988). Information-theoretic measures of fit for univariate and multi- 
variate linear regressions. The American Statistician 42, 249-252. 

Tiku, M. L. (1967). Tables of the power of the F-test. Journal of the American Statistical 
Association 62, 525-539. 

Tumer, D. L. (1990). An easy way to tell what you are testing in analysis of variance. 
Communications in Statistics—Series A, Theory and Methods 19, 4807-4832. 

Urquhart, N. S., D. L. Weeks, and C. R. Henderson (1973). Estimation associated with linear 
models: A revisitation. Communications in Statistics 1, 303—330. 

Verbeke, G. and G. Molenberghs (2000). Linear Mixed Models for Longitudinal Data. 
Springer-Verlag. 

Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the 
number of observations is large. Transactions of the American Mathematical Society 54, 
426-483. 

Wang, S. G. and S. C. Chow (1994). Advanced Linear Models: Theory and Applications. 
New York: Marcel Dekker. 

Weisberg, S. (1985). Applied Linear Regression. New York: Wiley. 

Welsch, R. E. (1975). Confidence regions for robust regression. Paper presented at Annual 
Meeting of the American Statistical Association, Washington, DC. 

Winer, B. J. (1971). Statistical Principles in Experimental Design (2nd ed.). New York: 
McGraw-Hill. 

Working, H. and H. Hotelling (1929). Application of the theory of error to the interpretation of 
trends. Journal of the American Statistical Association, Suppl. (Proceedings) 24, 73-85. 

Yates, F. (1934). The analysis of multiple classifications with unequal numbers in the different 
classes. Journal of the American Statistical Association 29, 52-66. 


Index 


Adjusted R, 162 
Alias matrix, 170 
Analysis of covariance, 443-478 
assumptions, 443-444 
covariates, 444 
estimation, 446—448 
model, 444-445 
one-way model with one covariate, 
449-451 
estimation of parameters, 449-450 
model, 449 
testing hypotheses, 448, 450-451 
equality of treatment effects, 
450-452 
homogeneity of slopes, 452—456 
interpretation, 456 
slope, 452 
one-way model with multiple covariates, 
464-472 
estimation of parameters, 465-468 
model, 464—465 
testing hypotheses, 468—469 
equality of treatment effects, 
468-469 
homogeneity of slope vectors, 
470-472 
slope vector, 470 
power, 444 
testing hypotheses, 448 
two-way model with one covariate, 
457-464 
model, 457 
testing hypotheses, 458—464 
homogeneity of slopes, 463-464 
main effects and interactions, 
458-462 
slope, 462 
unbalanced models, 473—474 
cell means model, 473 
constrained model, 473-474 


Analysis of variance, 295-338 
estimability of B in the empty cells 
model, 432, 434-435 
estimability of B in the non-full-rank 
model, 302—304 
estimable functions A’B, 305-308 
conditions for estimability of A’ B, 
305-307 
estimators of A’ B, 309-313 
BLUE properties of, 313 
covariance of, 312 
variance of, 311 
estimation of o* in the non-full-rank 
model, 313-314 
model, 3—4, 295-301 
one-way. See One-way model 
two-way. See Two-way model 
normal equations, 302—303 
solution using generalized inverse, 
302-303 
normal model, 314-316 
estimators of B and o*, 314-315 
properties of, 316 
and regression, 4 
reparameterization to full-rank model, 
318-320 
side conditions, 320—322, 433 
SSE in the non-full-rank model, 
313-314 
testable hypotheses, 323-324 
testable hypotheses in the empty cells 
model, 433 
testing hypotheses, 323-329 
full and reduced model, 324-326 
general linear hypothesis, 326—329 
treatments or natural groupings of units, 4 
unbalanced data. See Unbalanced data 
in ANOVA 
Angle between two vectors, 41-42, 136, 
163, 238 


Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce Schaalje 


Copyright © 2008 John Wiley & Sons, Inc. 


663 


664 INDEX 


Asymptotic inference for large samples, 
260-262, 491, 515 
Augmented matrix, 29 


Bayes’ theorem, 278-279 

Bayesian linear model, 279-284, 480 

Bayesian linear mixed model, 497 

Best linear predictor, 499 

Best linear unbiased estimators (BLUE), 
147, 165, 313 

Best quadratic unbiased estimators, 151, 
486 

Beta weights, 251 

BIC. See Information criterion 

BLUE. See Best linear unbiased estimators 


Causality, 3, 130-131, 443 
Chi-square distribution, 112-114 
central chi-square, 112 
moment-generating function, 112-113 
noncentral chi-square, 112-114 
noncentrality parameter, 112, 124 
Cluster correlation, 479-480, 481-485 
Coefficient of determination 
in multiple regression, 161-164 
in simple linear regression, 133-134 
Coefficient(s), regression, 2, 127 
Conditional density, 73, 95-99, 
278-284, 498-499 
Confidence interval(s) 
for 8; in simple linear regression, 133 
in Bayesian regression, 278, 285 
in linear mixed models, 491, 495 
in multiple regression. See Regression, 
multiple linear with fixed x’s, 
confidence interval(s) 
in random-x regression, 261—262 
Contrasts, 308, 341, 357-371 
Control of output, 3 
Correlation 
bivariate, 134 
Correlation matrix (matrices) 
population, 77-78 
relationship to covariance matrix, 77-78 
sample, 247 
relationship to covariance matrix, 
247-248 
Covariance matrix (matrices) 
for B, 145 
for partitioned random vector, 78 
population, 75-76 
sample, 156, 246-247 
for two random vectors, 82 


Data space, 153, 163, 316-317 
Dependent variable, 1, 137, 295 
Derivative, matrix and vector, 
56-59, 91, 109, 142, 158, 495 
Determinant, 37—41 
Determination, coefficient of. 
See Coefficient of determination 
Diagnostics, regression, 
227-238 also Hat matrix; 
Influential observations; 
Outliers; Residual(s) 
Diagonal matrix, 8 
DIC. See Information criterion 
Distance 
Mahalanobis, 77 
standardized, 77 
Distribution(s) 
chi-square, 112-114 
F, 114-116 
gamma, 280 
inverse gamma, 284 
multivariate t, 282—283, 285 
normal. See Normal distribution 
t, 216, 283 


Effect of each variable on R?, 
262-265 
Eigenvalues. See Matrix, eigenvalues 
Eigenvectors. See Matrix, eigenvectors 
Empty cells, 432—439 
Error sum of squares. See SSE 
Error term, 1, 137 
Estimated best linear unbiased predictor, 
499 
Estimated generalized least squares 
estimation, 490 
Exchangeability, 277 
Expected mean squares, 173-174, 179, 
182, 312-317, 362-367, 433 
Expected value 
of bilinear form [E(x’Ay)], 111 
of least squares estimators, 
131-132 
of quadratic form [E(y’Ay)], 107 
of R*, 162 
of random matrix, 75-76 
of random variable [E(y)], 70 
of random vector [E(y)], 75-76 
of sample covariance [E(s,,)], 112 
of sample variance [E(s?)], 108, 131, 150 
of sum of random variables, 70 
of sum of random vectors, 75-76 
Exponential family, 514 


F-Distribution, 114-116 
central F, 114 
mean of central F, 115 
noncentral F, 115 
noncentrality parameter, 115 
variance of central F, 115 
F-Tests. See also Regression, multiple 
linear with fixed x’s, tests of hypoth- 
eses; Tests of hypotheses 
general linear hypothesis test, 198-203 
for overall regression, 185 
power, 115 
subset of the B’s, 189 
False discovery rate, 206 
First order multivariate Taylor series, 495 
Fixed effects models, 480 


Gauss-Markov theorem, 146—147, 276. See 
also Best linear unbiased estimators 
Generalized least squares, 164-169, 
285-286, 479, 503 
Generalized linear models, 513-516 
exponential family, 514 
likelihood function, 512 
linear predictor, 513-514 
link function, 514 
model, 514 
Generalized inverse, 32—37, 302-303, 
343, 384 
of symmetric matrix, 33 
Generalized variance, 77, 88—89 
Geometry of least squares, 151-154, 163, 
316-317 
angle between two vectors, 163 
prediction space, 153-154, 163, 
316-317 
data space, 153, 163, 316-317 
parameter space, 152, 154, 316-317 
Gibbs sampling, 289, 291 


Hadamard product, 16, 425 

Hat matrix, 230-231 

Hessian matrix, 495 

Highest density interval, 279, 285 
Hyperprior distribution, 280, 287 
Hypothesis tests. See Tests of hypotheses 


Idempotent matrix 
for chi-square distribution, 117-118 
definition and properties, 54—55 
in linear mixed models, 487 
Identity matrix, 8 


INDEX 665 


Independence 
of contrasts, 358—362 
independence and zero covariance, 
93-94 
of linear functions and quadratic forms, 
119-120 
of quadratic forms, 120-121 
of random variables, 71, 94 
of random vectors, 93, 94 
of SSR and SSE, 187 
Influential observations, 235-238 
Cook’s distance, 236—237 
leverage, 236 
Information criterion, 286 
Iterative methods for finding estimates, 490 
Invariance 
of F, 149, 200 
of maximum likelihood estimators, 
247-248 
of R*, 149 
of s”, 149 
of t, 149 
of ¥, 148-149 
Inverse matrix. See Matrix, inverse 


j vector, 8 
J matrix, 8 


Kenward—Roger adjustment, 496-497 


Lagrange multiplier, 60, 68, 179, 201, 220, 
223, 429 
Least squares, 128, 131, 141, 143, 
145-151, 302, 507 
properties of estimators, 129-133, 143, 
145-147 
Likelihood function, 158, 513-514 
Likelihood ratio tests, 258-262 
Linear estimator, 143. See also Best linear 
unbiased estimators 
Linear mixed model, 480 
randomized blocks, 481—482 
subsampling, 482 
split plot studies, 483-484, 492-494 
one-way random effects, 484, 489 
random coefficients, 484—485 
heterogeneous variances, 485—486 
Linear model, 2, 137 
Linear models, generalized. See 
Generalized linear models 
Logistic regression, 508-511 
binary y, 508 
estimation, 510 


666 INDEX 


Logistic regression (Continued ) 
logit transformation, 509 
model, 509-510 
polytomous model, 511 

categorical, 511 
ordinal, 511 
several x’s, 510 

Logit transformation, 509 

Loglinear models, 511-512 
contingency table, 511 
likelihood ratio test, 512 
maximum likelihood estimators, 512 

LSD test, 209 


Mahalanobis distance, 77 
Markov Chain Monte Carlo, 288—289, 
291-292 
Matrix (matrices), 5—68 
addition of, 9-10 
algebra of, 5—60 
augmented matrix, 29 
bilinear form, 16 
Cholesky decomposition, 27 
conditional inverse, 33 
conformable matrices, 9 
definition, 5 
derivatives, 56—58 
determinant, 37—41 
of partitioned matrix, 38-40 
diagonal of a matrix, 7 
diagonal matrix, 8 
diagonalizing a matrix, 52 
differentiation, 56—57 
eigenvalues, 46-53, 496 
characteristic equation, 47 
and determinant, 51—52 
of functions of a matrix, 49-50 
of positive definite matrix, 53 
square root matrix, 53 
of product, 50—53 
of symmetric matrix, 51 
and trace, 51 
eigenvectors, 46-47, 496 
equality, 6 


generalized inverse, 32—37, 302, 343, 


384, 391-395 

of symmetric matrix, 36 
Hadamard product, 16, 425 
idempotent matrix, 54 

and eigenvalues, 54 
identity matrix, 8 
inverse, 21—23 

conditional inverse, 33 

generalized inverse, 32—37 


of partitioned matrix, 23-24 
of product, 22 
j vector, 8 
J matrix, 8 
multiplication of, 10 
conformal matrices, 10 
nonsingular matrix, 21 
notation, 5 
O (zero matrix), 8 
orthogonal matrix, 41-43 
partitioned matrix, 16—18 
multiplication of, 17 
positive definite matrix, 24—28 
positive semidefinite matrix, 25-28 
product, 10 
commutativity, 10 
as linear combination of columns, 17 
matrix and diagonal matrix, 16 
matrix and j, 12 
matrix and scalar, 10 
product equal to zero, 20 
rank of product, 21 
quadratic form, 16. See also Quadratic 
form(s) 
random matrix, 69 
rank, 19-21. See also Rank of a matrix 
spectral decomposition, 51, 360, 362, 
495-496 
square root matrix, 53 
sum of, 9 
symmetric matrix, 7 
spectral decomposition, 51 
trace, 44-46 
transpose, 7 
of product, 13 
triangular matrix, 8 
vector(s). See Vector(s) 
zero matrix (O) and zero vector (0), 8 
Matrix product. See Matrix, product 
Maximum likelihood estimators 
for B and o° in ANOVA, 315 
for B and o@ in fixed-x regression, 
158-159 
properties, 159-161 
for Bo, B;, and o7 in random-x 
regression, 245-248 
properties, 248-249 
invariance of, 249 
in loglinear models, 511 
for partial correlation, 266—268 
MCMC. See Markov Chain Monte Carlo 
Mean. See also Expected value 
sample mean. See Sample mean 
population mean, 70 


Missing at random, 432 
Misspecification of cov(y), 167—169. See 
also Generalized least squares 
Misspecification of model, 169-174 
alias matrix, 170 
overfitting, 170-172 
underfitting, 170-172 
Model diagnostics, 227-238. See also Hat 
matrix; Influential observations; 
Outliers; Residual(s) 
Model, linear, 2, 137 
Model validation, 227—238. See also Hat 
matrix; Influential observations; 
Outliers; Residual(s); 
Moment-generating function, 90-92, 96, 
99-100, 103-104, 108 
Multiple linear regression, 90-92, 108, 
112-114, 117-119, 122. See 
Regression, multiple linear with 
fixed x’s 
Multivariate delta method, 495 
Multivariate normal distribution, 87—103 
conditional distribution, 95—97 
density function, 88-89 
independence and zero covariance, 93—94 
linear functions of, 89 
marginal distribution, 93 
moment generating-function of, 90-92 
partial correlation, 100-101 
properties of, 92-100 


Noncentrality parameter 

for chi-square, 112 

for F, 114, 187, 192, 325 

for t, 116, 132 
Nonlinear regression, 507 

confidence intervals, 507 

least squares estimators, 507 

tests of hypotheses, 507 
Nonsingular matrix, 21 
Normal distribution 

multivariate. See Multivariate normal 

distribution 
univariate, 87—88 
standard normal, 87 

Normalizing constant, 278, 281, 284 


O (zero matrix), 8 
One-way model (balanced), 3, 295-298, 
339-376 
contrasts, 357-371 
and eigenvectors, 360-362 
hypothesis test for, 344-351 


INDEX 667 


orthogonal contrasts, 358-371 
independence of, 363-364 
orthogonal polynomial contrasts, 

363-371 
partitioning of sum of squares, 
360-361 
estimable functions, 340-341 
contrasts, 341 
estimation of o, 343-344 
expected mean squares, 351-357 
full-reduced—model method, 352—354 
general linear hypothesis method, 
354-356 
normal equations, 341-344 
solution using generalized inverse, 343 
solution using side conditions, 
342-343 
overparameterized model, 297 
assumptions, 297—298 
parameters not unique, 297 
reparameterization, 298 
side conditions, 298 
SSE, 314 
testing the hypothesis Ho: @,; = P2 = 
++ = py, 344-351 
full and reduced model, 344-348 
general linear hypothesis, 348-351 
Orthogonal matrix, 41-43 
Orthogonal polynomials, 363-371 
Orthogonal vectors, 40 
Orthogonal x’s in regression models, 149, 
174-178 
Orthogonality of columns of X in balanced 
ANOVA models, 333-335 
Orthogonality of rows of A in unbalanced 
ANOVA models, 293-296 
Orthogonalizing the x’s in regression 
models, 174-178 
and partial regression coefficients, 
175-176 
Outliers, 232—235 
mean shift outlier model, 235 
PRESS (prediction sum of squares), 235 
Overfitting, 170-172 


p-Value 
for F-test, 188-189 
for t-test, 132 
Parameter space, 152, 154, 316-317 
Partial correlation(s), 100-101, 266-273 
matrix of (population) partial 
correlations, 100-101 
sample partial correlations, 177-178, 
266-173 


668 INDEX 


Partial interaction constraints, 434 
Poisson distribution, 512 
Poisson regression, 512-513 
likelihood function, 513 
model, 513 
Polynomials, orthogonal. See Orthogonal 
polynomials 
Positive definite matrix, 24—28 
Positive semidefinite matrix, 25—28 
Posterior distribution, 278-284 
conditional, 289 
marginal, 282 
Posterior predictive distribution, 279, 
290-292 
Prediction, 2—3, 137, 142, 148, 
156, 161 
Precision, 280 
Prediction of a random effect, 497—499 
Prediction interval, 213—215 
Prediction space, 153-154, 163, 
316-317 
Prediction sum of squares (PRESS), 235 
PRESS (prediction sum of squares), 235 
Prior distribution, 278—284 
diffuse, 281, 287 
informative, 281 
conjugate, 281, 289 
specification, 280 
Projection matrix, 228 


Quadratic form(s), 16, 489 
distribution of, 117—118 
expected value of, 107 
idempotent matrix, 106 
independence of, 119-121 
moment-generating function of, 108 
variance of, 108 
r in simple linear regression, 133-134 
R? (squared multiple correlation), 
161-164, 254-257 
effect of each variable on R*, 262-265 
fixed x’s, 161-164 
adjusted R?, 162 
angle between two vectors, 163 
properties of R? and R, 162 
random x’s, 254—257 
population multiple correlation, 254 
properties, 255 
sample multiple correlation, 256 
properties, 256—257 
Random matrix, 69 
Random model, 480 


Random variable(s), 69 
correlation, 74 
covariance, 71 
and independence, 71-74 
expected value (mean), 70 
independent, 71, 94 
mean (expected value), 70 
standard deviation, 71 
variance, 70 
Random vector(s), 69—74 
correlation matrix, 77—78 
covariance matrix, 75—76, 83 
linear functions of, 79-83 
mean of, 80 
variances and covariances of, 
81-83 
mean vector, 75—76 
partitioned, 78—79 
Random x’s in regression. See Regression, 
random x’s 
Rank of a matrix, 19-21 
full rank, 19 
rank of product, 20-21 
Regression coefficients (8's), 2, 
138, 251 
partial regression coefficients, 138 
standardized coefficients (beta weights), 
251 
Regression, logistic. See Logistic regression 
Regression, multiple linear with fixed x’s, 
2-3, 137-184 
assumptions, 138-139 
centered x’s, 154-157 
coefficients. See Regression coefficients 
confidence interval(s) 
for B, 209 
for E(y), 211-212 
for one a’ B, 211 
for one B;, 210-211 
for a”, 215 
for several a; B’s, 216-217 
for several B;'s, 216 
design matrix, 138 
diagnostics, 227—238. See also 
Diagnostics, regression 
estimation of Bo, Bi,..., By 141-145 
with centered x’s, 154-157 
least squares, 2, 143-144 
maximum likelihood, 
158-159 
properties of estimators, 
145-149 
with sample covariances, 157 


estimation of 0” 
maximum likelihood estimator, 
158-159 
minimum variance unbiased 
estimator, 158-159 
unbiased estimator, 149-151 
best quadratic unbiased estimator, 
151 
generalized least squares, 164-169 
minimum variance estimators, 
158-159 
misspecification of error structure, 
151-153 
misspecification of model, 169-174. 
See also Misspecification of model 
model, 137-140 
multiple correlation (R), 161-162 
normal equations, 141-142 
orthogonal x’s, 149, 174-178 
orthogonalizing the x’s, 174-178 
outliers, 232—235. See also Outliers 
partial regression, 141 
prediction. See Prediction 
prediction equation, 142 
prediction interval, 213-215 
properties of estimators, 145-149 
purposes of, 2-3 
random x’s. See Regression, 
random x’s 
residuals, 227—230. See also Residuals 
sufficient statistics, 159-160 
tests of hypotheses 
all possible a’B, 193-194 
expected mean squares, 
173-174 
general linear hypothesis test 
Ho: CB = 0, 198-203 
estimation under reduced model, 
324-326 
full and reduced model, 324—326 
Ho: CB = t, 203-204 
likelihood ratio tests, 217-221 
distribution of likelihood ratio, 
218-219 
likelihood ratio, 218 
for Ho: B = 0, 219-220 
for Hy: CB = 0, 220-221 
linear combination a’ B, 
204-205 
one B;, 204—205 
F-test, 204—205 
t-test, 205 
overall regression test, 185-189 


INDEX 669 


in terms of R?, 196-198 
several aj B’s, 205 
several B;’s 
Bonferonni method, 206—207 
experimentwise error rate, 206 
overall a-level, 206 
Scheffé method, 207—209 
subset of the B’s, 189-196 
expected mean squares, 193, 196 
full and reduced model, 190 
noncentrality parameter, 192—193 
quadratic forms, 190-193, 195 
in terms of R’, 196 
weighted least squares, 168 
X matrix, 138-139 
Regression, nonlinear. See Nonlinear 
regression 
Regression, Poisson. See Poisson regression 
Regression, random x’s, 243-273 
multivariate normal model, 244 
confidence intervals, 258—262 
estimation of Bo, B,, and Oo, 245-249 
properties of estimators, 249 
standardized coefficients (beta 
weights), 251 
in terms of correlations, 249-154 
R?, 254-257. See also R*, random 
x’s 
effect of each variable on R’, 
262-265 
tests of hypotheses, 258-262 
comparison with tests for fixed x’s, 
258 
correlations, tests for, 260—261 
Fisher’s z-transformation, 261 
likelihood ratio tests, 258—260 
nonnormal data, 265-266 
estimation of Bo and B, 266 
sample partial correlations, 266—273 
maximum likelihood estimators, 268 
other estimators, 269-271 
Regression, simple linear (one x), 1, 
127-136 
assumptions, 127 
coefficient of determination r”, 
133-134 
confidence interval for Bo, 134 
confidence interval for B,, 132-133 
correlation r, 133-134 
in terms of angle between 
vectors, 135 
estimation of Bg and B,, 128-129 
estimation of oO, 131-132 


670 INDEX 


Regression, simple linear (Continued) 
model, 127 
properties of estimators, 131 
test of hypothesis for Bo, 119 
test of hypothesis for B,, 132-133 
test of hypothesis for p, 134 
Regression sum of squares. See SSR 
Regression to the mean, 498 
Residual(s), 131, 227-230 
deleted residuals, 234 
externally studentized residual, 234 
hat matrix, 228, 230—232 
in linear mixed models, 501—502 
plots of, 230 
properties of, 237-230 
residual sum of squares (SSE), 131, 
150-151. See SSE 
studentized residual, 233 
Response variable, 1, 137, 150 
Robust estimation methods, 232 


Sample mean 
definition, 105-106 
independent of sample variance, 
119-120 
Sample space (data space), 152—153 
Sample variance (s?), 107-108 
best quadratic unbiased estimator, 151 
distribution, 118 
expected value, 108, 127 
independent of sample mean, 120 
Satterthwaite, 494 
Scalar, 6 
Scientific method, | 
Selection of variables, 2, 172 
Serial correlation, 479 
Shrinkage estimator, 287, 500 
Significance level (a), 132 
Simple linear regression. See Regression, 
simple linear 
Singular matrix, 22 
Small sample inference for mixed linear 
models, 491-491, 494-497 
Span, 153 
Spectral decomposition, 51, 
495-496 
Square root matrix, 53 
SSE (error sum of squares) 
balanced ANOVA 
one-way model, 343-344 
two-way model, 385, 390-391 
independence of SSR and SSE, 187 
multiple regression, 150-156, 179 
non-full-rank model, 313-314 


simple linear regression, 131-132 
unbalanced ANOVA 
one-way model, 417 
two-way model 
constrained, 428 
unconstrained, 432 
SSH (for general linear hypothesis test) 
in ANOVA, 326-329, 348-351, 
401-403 
in regression, 199, 203 
SSR (regression sum of squares), 133-134, 
161, 164, 186-189 
Standardized distance, 77 
Subspace, 153, 317 
Sufficient statistics, 159-160 
Sum(s) of squares 
Analysis of covariance, 449-463, 
468-473 
ANOVA, balanced 
one-way, 345-346, 348-351 
contrasts, 358-363, 367-331 
two-way, 388-395, 395-403 
ANOVA, unbalanced 
one-way, 417 
contrasts, 417-421 
two-way, 426, 431-432 
full-and-reduced-model test in ANOVA, 
324-326 
SSE. See SSE 
SSH (for general linear hypothesis test). 
See SSH 
SSR (for overall regression test). See SSR 
as quadratic form, 105-107 
test of a subset of B’s, 190-192 
Symmetric matrix, 7 
Systems of equations, 28-32 
consistent and inconsistent, 29 
and generalized inverse, 37-39 


t-Distribution, 116-117, 123 
central t, 117 
noncentral t, 116-117, 132 
noncentrality parameter, 116-117, 132 
p-value. See p-Value 
t-Tests, 123, 131-132, 134, 205 
p-value. See p-Value 
Tests of hypotheses. See also Analysis 
of variance, testing hypotheses; 
One-way model (balanced), testing 
the hypothesis Ho: wy = Wo = = 
bx; Two-way model (balanced), tests 
of hypotheses 
for B, in simple linear regression, 
131-132 


in Bayesian regression, 286 
F-tests. See F-Tests 
general linear hypothesis test, 198—204 
for individual ’s or linear combinations. 
See Regression, multiple linear with 
fixed x’s, tests of hypotheses 
likelihood ratio tests, 217-221 
in linear mixed models, 491, 495 
overall regression test, 185-189, 196 
for p in bivariate normal distribution, 134 
regression tests in terms of R°, 
196-198 
significance level (@), 132 
subset of the B’s, 189-196 
t-tests. See t-Tests 
Trace of a matrix, 44-46 
Transpose, 7 
Treatments, 4, 295, 339, 377 
Triangular matrix, 8 
Two-way model (balanced), 3, 
299-301, 377-408 
estimable functions, 378—382 
estimates of, 382—384 
interaction terms, 380 
main effect terms, 380—381 
estimation of 07, 384-385 
expected mean squares, 403-408 
quadratic form approach, 405 
sums of squares approach, 403-405 
interaction, 301, 377 
model, 377-378 
assumptions, 378 
no-interaction model, 329—335 
estimable functions, 330—331 
testing a hypothesis, 331-333 
normal equations, 382—384 
orthogonality of columns of X, 
333-335 
reparameterization, 299-300 
side conditions, 300-301, 381 
SSE, 384, 390 
tests of hypotheses 
interaction 
full-and-reduced-model test, 
388-391 
generalized inverse approach, 
391-395 
hypothesis, 385-388 
main effects 
full-and-reduced-model approach, 
395-401 
general linear hypothesis approach, 
401-403 
hypothesis, 396 


INDEX 671 


Unbalanced data in ANOVA 
cell means model, 414 
one-way model, 415-421 
contrasts, 417—421 
conditions for independence, 418 
orthogonal contrasts, 418 
weighted orthogonal contrasts, 419 
estimation, 415-416 
SSE, 416 
testing Hp: wy = Po =... = My, 416 
overparameterized model, 414 
serial correlation, 479 
two-way model, 421-432 
cell means model, 421, 422 
constrained model, 428—432 
estimation, 430 
model, 429 
SSE, 431 
testing hypotheses, 431-432 
type I, II and III sums of squares, 414 
unconstrained model, 421—428 
contrasts, 424—425 
estimator of Oo, 423 
Hadamard product, 425 
SSE, 423 
testing hypotheses, 425-428 
two-way model with empty 
cells, 432—439 
estimability of empty cell means, 435 
estimation for the partially 
constrained model, 434 
isolated cells, 432 
missing at random, 432 
testing the interaction, 433-434 
SSE, 433 
weighted squares of means, 414 
Underfitting, 170-172 


Validation of model, 227—238. See also Hat 
matrix; Influential observations; 
Outliers; Residual(s) 

Variable(s) 

dependent, 1, 137 

independent, 1, 137 

predictor, 1, 137 

response, 1, 137 

selection of variables, 2, 172 
Variance 

of estimators of A’B, 311 

generalized, 77 

of least squares estimators, 130-131 

population, 70-71 

of quadratic form, 107 

sample, 95. See also Sample variance 


672 INDEX 


Variance components, 480 orthogonal vectors, 37 
estimating equations, 488 orthonormal vectors, set of, 38 
estimation, 486—489 product of, 10-11 
Vector(s) random vector. See Random Vectors 
angle between two vectors, 41—42, 136, row vector, 6 
163, 238 zero vector (0), 8 


column vector, 6 

peer - Weighted least squares, 168 
linear independence and dependence, 19 

normalized vector, 42 Zero matrix (O), 8 
notation, 6 Zero vector (0), 8 


